CloudDB: A Data Store for all Sizes in the Cloud



Similar documents
How To Handle Big Data With A Data Scientist

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Preparing Your Data For Cloud

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

bigdata Managing Scale in Ontological Systems

Introduction to Apache Cassandra

Cloud Scale Distributed Data Storage. Jürmo Mehine

Can the Elephants Handle the NoSQL Onslaught?

Scalable Architecture on Amazon AWS Cloud

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Introduction to Cloud Computing

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Software-Defined Networks Powered by VellOS

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

CLOUD COMPUTING. When It's smarter to rent than to buy

Structured Data Storage

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Lecture Data Warehouse Systems

Understanding Object Storage and How to Use It

How To Scale Out Of A Nosql Database

Elasticity in Multitenant Databases Through Virtual Tenants

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

CHAPTER 8 CLOUD COMPUTING

Technology Enablement

Benchmarking and Analysis of NoSQL Technologies

GigaSpaces Real-Time Analytics for Big Data

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

NoSQL Databases. Polyglot Persistence

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Introduction to NOSQL

Building a SaaS Application. ReddyRaja Annareddy CTO and Founder

Cloud Computing Paradigm

Databases 2 (VU) ( )

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

NoSQL Data Base Basics

Manjrasoft Market Oriented Cloud Computing Platform

Divy Agrawal and Amr El Abbadi Department of Computer Science University of California at Santa Barbara

Trafodion Operational SQL-on-Hadoop

2.1.5 Storing your application s structured data in a cloud database

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

INTRODUCTION TO CASSANDRA

NoSQL Systems for Big Data Management

Virtualizing Apache Hadoop. June, 2012

The Quest for Extreme Scalability

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Relocating Windows Server 2003 Workloads

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

So What s the Big Deal?

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

White Paper. Cloud Native Advantage: Multi-Tenant, Shared Container PaaS. Version 1.1 (June 19, 2012)

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Practical Cassandra. Vitalii

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

NoSQL. Thomas Neumann 1 / 22

Daniel J. Adabi. Workshop presentation by Lukas Probst

Cloud Computing Now and the Future Development of the IaaS

The Impact of PaaS on Business Transformation

WHAT IS ENTERPRISE OPEN SOURCE?

The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

In Memory Accelerator for MongoDB

A programming model in Cloud: MapReduce

I D C T E C H N O L O G Y S P O T L I G H T

The Total Cost of (Non) Ownership of a NoSQL Database Cloud Service

LARGE-SCALE DATA STORAGE APPLICATIONS

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

5-Layered Architecture of Cloud Database Management System

The Sierra Clustered Database Engine, the technology at the heart of

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

To run large data set applications in the cloud, and run them well,

Building the Internet of Things Jim Green - CTO, Data & Analytics Business Group, Cisco Systems

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Relational Databases in the Cloud

The Cloud is Not Enough Why Hybrid Infrastructure is Shaping the Future of Cloud Computing


Cloud Computing Trends

Data Modeling for Big Data

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Open source large scale distributed data management with Google s MapReduce and Bigtable

Data Integration Checklist

nosql and Non Relational Databases

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Big Systems, Big Data

Now that you have a Microsoft private cloud, what the heck are you going to do with it?

Simple. Extensible. Open.

Transcription:

CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management Research NEC Laboratories America http://www.nec-labs.com/dm www.nec-labs.com

What I will try to cover Historical perspective and motivation (Preliminary) Technical Approach Current Status Food for Thought 2 NEC Labs Data Management Research

Why Data Management Research? Many Data Management Technologies and Products have been around Data Centers have evolved over the time Data Center hosting became a business Database Community was successful in creating technologies and business 3 NEC Labs Data Management Research

Why Data Management (Again)? New Data Types Amount of Data Amount of business data doubles every 12-18 months New Type of Apps Relational databases only manage 10-15% of the available data (Good Old) Database New Data Sources Individual user via Web2.0 applications, social sides, collaboration, mobile devices, sensors, etc New Usage Patterns Highly integrated, Extremely data intensive Large Number of Users Unprecedented increase and fluctuations Around the clock, around the world, highly interconnected 4 NEC Labs Data Management Research

Cloud Computing A paradigm shift in how and where a workload is generated and it gets executed Cloud service provider Cloud service consumer Cloud Provider A P I Market Size Data Management Market ~$20B IT Cloud Service ~$42B (by 2012) (IDC) 5 NEC Labs Data Management Research

Cloud Computing A paradigm shift in how and where a workload is generated and it gets executed Cloud service provider Cloud service consumer Cloud Provider A P I Market Size Data Management Market ~$20B IT Cloud Service ~$42B (by 2012) (IDC) 6 NEC Labs Data Management Research

Animoto on Amazon EC2 A no-infrastructure startup Biggest piece of hardware A (fancy) espresso machine! Rapid growth in three days, the number of users increased from 25k to 250k Number of servers from 50 to 3500 Assume $500 per machine, $1.75M! Instead, they used Amazon EC2 Problem: It is not trivial to distribute users accesses to the data by just scaling out cloud computing nodes 7 NEC Labs Data Management Research

Database-as-a-Service? ICDE 2002! Reaction: Cool but Technology Business Model Regulations Psychological Acceptance 8 NEC Labs Data Management Research

Data Management in Cloud Cloud computing model may provide a platform to address new challenges But the problem is: Data Management Systems were not designed and implemented with cloud computing model in mind So the question is: What are the data management challenges we need to address before the full potential of cloud computing can be realized? 9 NEC Labs Data Management Research

Need for New Solutions Massive scalability to handle Very large amount of data Very large number of diverse users/requests Elasticity to handle varying demand optimize operating costs Flexibility to handle different data and processing models Massively multi-tenanted to achieve economies of scale More intelligent system monitoring and management 10 NEC Labs Data Management Research

Cloud Data Management Challenges Key challenge: scalable scan and aggregation CloudDB # of records / query Data scalability Key challenge: scalable multitenant hosting Multi-tenancy Large Analytic apps (OLAP) Small apps Large Transactional apps (OLTP) Key challenge: scalable read/write Query scalability # of queries / sec Ultimate goal Key challenge: seamless data management 11 NEC Labs Data Management Research

Buy All Sizes?? NO! OLAP OLTP 12 NEC Labs Data Management Research

Buy One Size? OLAP OLTP 13 NEC Labs Data Management Research

Let Someone Else Do All That Access and Management OLAP OLTP 14 NEC Labs Data Management Research

Let Someone Else Do All That Leveraging very specialized database technologies Easier integration with applications Access and Management Easier adoption by developers (dominant force for adoption of cloud!) OLAP OLTP Easier and more flexible deployment options in the middleware 15 NEC Labs Data Management Research

Wish Lists Clients Service Provider - Standard language API (e.g., SQL) - Identifiable and verifiable Service Level Agreements - Common DBMS maintenance tasks, (e.g. backup, versioning, patching etc.) - Satisfying clients SLAs to sustain revenue - Great cost efficiency via high level of automation and resource sharing to ensure profitability - Maintaining an extendable platform for value-add services - Availability of value-add services, such as business analytics, information sharing, collaboration etc. 16 NEC Labs Data Management Research

(Some) Storage Models Store Type Main Purpose Pro Con Relational - Transaction processing - Standardization - Higher performance on Online Transaction Processing (OLTP) - ACID properties - Scalability Key/Value - Scalable data storage - Read/Write intensive workload -Scalability - Standardization - Performance issues - Complex query capability - ACID properties(?) Column-Oriented - Analytics processing - Read optimized, throughput oriented -Higher performance on Online Analytical Processing (OLAP) - More flexible schema evolution (?) - Standardization - Complex query capability 17 NEC Labs Data Management Research

Application Scenario Relational Database Key/Value Store Application v1 Application v2 Personal Profile Management Information Portal Profile Data User 1 Data Portal Data Products Address Phone Notes Contacts Calendars Reminders Online Shopping Catalogs Product Reviews Subscriptions User 2 Data Reviews..... Very difficult migration Application developers (skills, time) Architects (redesign) Company (investment) External Sources 18 NEC Labs Data Management Research

Data Model Decisions Problem: Users are forced to make a decision on the data model based on the current needs of the applications Is it possible to make the right decision all the time? Problem: The developer (client) has to re-architect their application in order to take advantage of different data models How easy is it to change the architecture and the implementation? Application Ver 1.0 Ver 2.0 Ver 3.0 Ver 4.0 Workload evolves Single RDBMS Clustering Sharding Key-value store # of queries /sec 19 NEC Labs Data Management Research

Remember Data Independence? 1968 1970 20 NEC Labs Data Management Research

Data Independence Decouple application logic from data processing Let them be optimized and managed independently Enabled decades of innovation and improvement in databases 21 NEC Labs Data Management Research

Data Independence The application should not have to be aware of the physical organization of the data (and how it can be accessed) All it needs is a logical (declarative) specification CloudDB makes decisions based on application context, workload characteristics, etc. Data Load Application Query/Update SQL API CloudDB: A layer for data independence # of queries /sec Relational Store Analytics Store Key/Value Store 22 NEC Labs Data Management Research

Language? New Breed Databases CouchDB, Project Voldemort (Dynamo), Cassandra, BigTable, Tokyo Cabinet, MangoDB, SimpleDB,. MapReduce/Hadoop 23 NEC Labs Data Management Research

Some Reminders about SQL By far the most widely used data access language It has nothing to do with How the data is stored How the queries are executed How the transactions are handled Very large number of skilled programmers Huge amount of existing applications and tools 24 NEC Labs Data Management Research

SQL is actually good? HIVE: SQL API op top of MapReduce Google BigQuery: SQL over data stored in non-relational databases. 25 NEC Labs Data Management Research

CloudDB - Guiding Principals Embrace heterogeneity One size does not fit all Leverage specialized technologies Maintain and restore declarative nature of data processing Understand and Define dimensions of scalability 26 NEC Labs Data Management Research

CloudDB Middleware Opaque vs. Transparent Data Stores CloudDB Middleware Applications SQL Queries Results Transaction Patterns API/Language Support (SQL) Distributed Query Processor Opaque Transparent Consistency / Scalability. System Independence? The middleware would be responsible for making all the decisions regarding the choice of data stores, processing the queries, and end-to-end system optimization While the middleware can abstract away the underlying storage systems, it should explicitly express certain essential aspects of the system, such as consistency levels and scalability of transactions 27 NEC Labs Data Management Research

CloudDB Platform Intelligent Cloud Database Coordinator (ICDC) Design Workload Optimizer Analysis Multi Tenancy Manager (MTM) Cluster Controller Client SLAs Capacity Planner System Monitor Database SQL Queries (External) Applications Distributed Query Processor Results API/Language Support (JDBC,SQL) SLA Aware Dispatcher Scheduler Scheduler Scheduler Auto Replication Internal Query Processing Auto Partitioning Internal Query Processing Auto Sharding Auto Replication Internal Query Processing Auto Partitioning Relational Store 28 NEC Labs Data Management Research Key-Value Store Data Migration CloudDB Store Analytics Store

CloudDB Platform Key Points Intelligent Cloud Database Coordinator (ICDC) Design Workload Optimizer Analysis Multi Tenancy Manager (MTM) Cluster Controller Client SLAs Capacity Planner System Monitor Database SQL Queries (External) Applications Distributed Query Processor Results API/Language Support (JDBC,SQL) SLA Aware Dispatcher One Unified, Standard API Scheduler Scheduler Scheduler Auto Replication Internal Query Processing Auto Partitioning Intelligent Analysis and Decision Making Relational Store 29 NEC Labs Data Management Research Internal Query Processing Auto Sharding Key-Value Store Data Migration CloudDB Store Auto Replication Internal Query Processing Analytics Store Auto Partitioning Specialized Stores for Specific Needs

Our Data Management Platform Key Research Areas Intelligent Cloud Database Coordinator (ICDC) Design Workload Optimizer Analysis Cluster Controller Client SLAs Intelligent Management Multi Tenancy Manager (MTM) Capacity Planner System Monitor Database SQL Queries (External) Applications Distributed Query Processor Results API/Language Support (JDBC,SQL) One Unified, Standard API SLA Workload Aware Dispatcher Management Scheduler Scheduler Scheduler Auto Replication Internal Query Processing Auto Partitioning Relational Store Intelligent Analysis and Decision Making 30 NEC Labs Data Management Research Internal Query Processing Auto Sharding Data Stores Key-Value Store Data Migration CloudDB Store Auto Replication Internal Query Processing Auto Partitioning Specialized Stores for Analytics Specific Store Needs

CloudDB System Architecture -- Microsharding is a part of CloudDB Intelligent Cloud Database Coordinator (ICDC) Design Workload Optimizer Analysis Multi Tenancy Manager (MTM) Cluster Controller Client SLAs Capacity Planner System Monitor Database (External) Applications SQL Results Queries Microsharding API/Language Support (JDBC,SQL) Distributed Query Processor SLA Aware Dispatcher Scheduler Scheduler Scheduler Auto Replication Internal Query Processing Auto Partitioning Internal Query Processing Auto Sharding Auto Replication Internal Query Processing Auto Partitioning Relational Store 31 NEC Labs Data Management Research Key-Value Store Data Migration CloudDB Store Analytics Store

SQL over Key-Value Stores Microsharding to enable SQL over key-value stores Applications Application SQL Application Key challenge: limited access capabilities (only key-based put/ get) Query execution nodes (Relational middleware) Pool of Servers Storage nodes (Storage cloud) Keyaccess Key-Value Store Pool of Servers 32 NEC Labs Data Management Research

Microsharding Key-Value stores are good at scaling write intensive workloads But, they don t leverage a large body of technologies developed in databases over the decades such as: Relationships Transactions Advanced query functions etc. These are hand-coded by developers Microsharding aims at bringing those capabilities into keyvalue stores in a principled way 33 NEC Labs Data Management Research

Key Technical Questions Addressed How can we map relational schemas to key-value store data models? How can we map relational tuples to key-value objects? Once we have those mappings, how can we define transaction classes that can be supported in a scalable way in key-value stores? What are the system implementation issues with such a middleware? 34 NEC Labs Data Management Research

Query and Data Transformation Physical design: mapping between relational data and K/V data TABLE users ( id primary key ) TABLE reviews ( id: primary key Schema (+data) user_id : foreign key to orders ) Physical Design NEST reviews BY user_id. Transformed data (KV data) Query plan SELECT * FROM users, reviews WEHRE users.id= reviews.user_id and users.id =? Query (template) users reviews reviews reviews GET UNNEST Microshard User[Review] 35 NEC Labs Data Management Research

Microsharding A microshard is a logical unit of data a principled way to shard a database into small fragments a unit of transactional data access is accessed by its key, key of root relation microshard microshard microshard microshard Key= 1 Key= 2 Key= 3 Key= N Transaction on Users key =1 Transaction on Users key =1 Transaction on Users key =2 Transaction on Users key =3 36 NEC Labs Data Management Research

Isolation Levels No consistency guarantee on read/write outside of a microshard transaction group T T T transaction group T T T Distributed on query execution nodes microshard microshard Distributed on key-value store 37 NEC Labs Data Management Research

Throughput (1000 sessions / sec) Scale Independence Experiment Setup RUBiS benchmark (ebay type auction application) Read/Write workload (transition matrix) Short think time to saturate the system Voldemort (Dynamo) key-value store 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 3 Voldemort nodes 4 Voldemort nodes 5 Voldemort nodes 6 Voldemort nodes 0 2.5 5 7.5 10 12.5 15 17.5 20 Number of emulated concurrent clients (thousands) Message: Ability to automatically scale to more concurrent sessions (throughput) simply by increasing the number of key-value nodes 38 NEC Labs Data Management Research

Directions/Questions Support for Specifying Relaxed Consistency Tooling to relax consistency just to the degree that there exists a feasible solution (physical design and query plans) for the specification Scalable Data Organization over heterogeneous data stores Physical design over heterogeneous stores such that the service level specifications are met Scalability vs. Consistency 39 NEC Labs Data Management Research

The Cast NEC Labs Researchers Hakan Hacigumus Yun Chi Wang-Pin Hsiung Hojjat Jafarpour Hyun J. Moon Oliver Po Junichi Tatemura Jagan Sankaranarayanan Advisors/Collaborators Michael Carey (U. of California, Irvine) Hector Garcia-Molina (Stanford) Jeff Naughton (U. of Wisconsin, Madison) 40 NEC Labs Data Management Research

CloudDB would be A unified data management platform that provides capabilities to transparently and efficiently support heterogeneous workloads by leveraging specialized storage models with SLA-conscious profit optimization in the cloud. 41 NEC Labs Data Management Research

42 NEC Labs Data Management Research Thank You!