Daniel J. Adabi. Workshop presentation by Lukas Probst

Similar documents

Report Data Management in the Cloud: Limitations and Opportunities

Data Management in the Cloud. Zhen Shi

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Ethopian Database Management system as a Cloud Service: Limitations and advantages

Data Management in the Cloud: Limitations and Opportunities

Data Management in the Cloud

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

In Memory Accelerator for MongoDB

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

The Inside Scoop on Hadoop

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Cloud Computing at Google. Architecture

bigdata Managing Scale in Ontological Systems

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Can the Elephants Handle the NoSQL Onslaught?

Innovative technology for big data analytics

SQL Server 2008 Performance and Scale

In-Memory Analytics for Big Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Next-Generation Cloud Analytics with Amazon Redshift

Module 14: Scalability and High Availability

Improving MapReduce Performance in Heterogeneous Environments

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Structured Data Storage

Scaling Out With Apache Spark. DTL Meeting Slides based on

Data Management in the Cloud -

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Actian Vector in Hadoop

Challenges for Data Driven Systems

Maginatics Cloud Storage Platform for Elastic NAS Workloads

Distributed Data Stores

Oracle Database 12c Plug In. Switch On. Get SMART.

ICONICS Choosing the Correct Edition of MS SQL Server

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

How to Enhance Traditional BI Architecture to Leverage Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

I N T E R S Y S T E M S W H I T E P A P E R F O R F I N A N C I A L SERVICES EXECUTIVES. Deploying an elastic Data Fabric with caché

Apache Hadoop. Alexandru Costan

From Spark to Ignition:

Updating Your Skills to SQL Server 2016

More Data in Less Time

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

MagFS: The Ideal File System for the Cloud

CitusDB Architecture for Real-Time Big Data

Introduction to Cloud Computing

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Architectures for Big Data Analytics A database perspective

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

CloudDB: A Data Store for all Sizes in the Cloud

Advanced Big Data Analytics with R and Hadoop

Move Data from Oracle to Hadoop and Gain New Business Insights

Big Data and Big Analytics

Oracle: Database and Data Management Innovations with CERN Public Day

Data Warehouse: Introduction

To run large data set applications in the cloud, and run them well,

Big Data and Market Surveillance. April 28, 2014

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Course Outline. Upgrading Your Skills to SQL Server 2016 Course 10986A: 5 days Instructor Led

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

A1 and FARM scalable graph database on top of a transactional memory layer

ORACLE DATABASE 10G ENTERPRISE EDITION

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Real Time Big Data Processing

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Hadoop Architecture. Part 1

IT and Storage for Big Data Analytics

GeoKettle: A powerful open source spatial ETL tool

Can Storage Fix Hadoop

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

SQL Maestro and the ELT Paradigm Shift

INTRODUCING APACHE IGNITE An Apache Incubator Project

In-Memory BigData. Summer 2012, Technology Overview

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Transcription:

Daniel J. Adabi Workshop presentation by Lukas Probst

3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted host 3. Data is replicated, often across large geographic distances 14.12.2012 2

Cloud 14.12.2012 3

Read is easy parallelizable Write needs propergation Shared-nothing architecture 14.12.2012 4

Encryption is necessary 14.12.2012 5

14.12.2012 6

Transactional data management (OLTP) Rely on the ACID guarantees that the database provide Write-intensive Analytical data management (OLAP) The scale of OLAP systems is generally larger than OLTP systems Read-mostly (or read-only) with occasional batch inserts Check if OLTP applications are likely to be deployed in the cloud 14.12.2012 7

None of the 4 big players has a sharednothing transactional database Non-trivial to implement one data is partitioned across sites transactions cannot be restricted to accessing data from a single site Main benefit (scalability) is less relevant 14.12.2012 8

CAP theorem: Chose at most two out of three properties Consistency vs. Availability The C part of ACID is typically compromised to yield reasonable system availability 14.12.2012 9

OLTP DBs contain complete set of operational data needed to power mission-critical business processes Data includes detail at the lowest granularity sensitive information Untrusted hosts are unacceptable 14.12.2012 10

OLTP applications are not wellsuited for cloud deployment 14.12.2012 11

Transactional data management (OLTP) Rely on the ACID guarantees that the database provide Write-intensive Analytical data management (OLAP) The scale of OLAP systems is generally larger than OLTP systems Read-mostly (or read-only) with occasional batch inserts Check if OLAP applications are likely to be deployed in the cloud 14.12.2012 12

Scalability is very important Shared-nothing architecture scales the best Data analysis workloads are easy to parallelize across nodes in a shared-nothing network Only infrequent writes 14.12.2012 13

A, C and I are easy to obtain only infrequent writes sufficient to perform the analysis on a recent snapshot Consistency tradeoffs are not problematic for analytical databases 14.12.2012 14

4 possibilities to handle sensitive data for analysis 1. Leave them out of the analytical data store 2. Include them after anonymization 3. Include them after encryption 4. Analyze only less granular versions of the data Untrusted hosts can be used for analysis 14.12.2012 15

OLAP applications are well-suited for cloud deployment Concentrate on Data Analysis (OLAP) 14.12.2012 16

Cloud DBMS Wish List Check how close two currently available solution attaining these properties MapReduce-like software (e.g., Hadoop) Commercially available shared-nothing parallel databases 14.12.2012 17

1. Efficiency 2. Fault Tolerance 3. Ability to run in a heterogeneous environment 4. Ability to interface with business intelligence products (virtualization, query generation, ) 5. Ability to operate on encrypted data 14.12.2012 18

MapReduce Shared-nothing parallel DBs MapReduce is much slower than alternative systems Was not designed for complete, end-to-end data analysis systems over structured data In structured data Queries tend to access only a subset of the data For the business-oriented data analysis market, MapReduce can be wildly inefficient Uses helper structures which accelerate the access The use of helper Structures outperforms MapReduce s brute-force strategy The one-time cost of their creation is outweighed by the benefit each time they are used 14.12.2012 19

MapReduce Designed with fault tolerance as a high priority Split 0 Split 1 Split 2 Split n read split 0 Worker... read split 0 Worker Worker Assign split 0 Master reassign split 0 Shared-nothing parallel DBs Most parallel database systems restart query upon a failure Designed to run in environments where failures are relatively rate This is not the case for Clouds Map 14.12.2012 20

MapReduce Designed to run in a heterogeneous environment Shared-nothing parallel DBs Generally designed to run on homogeneous equipment Split 0 Split 1 Split 2 Split n read split 0 read split 0 slow Worker Worker Worker Assign split 0 Master reassign split 0 if sill in progress when nearly all other workes have finished yet... Can significantly degrade performance if a small subset of nodes in the parallel cluster are performing particularly poor Map 14.12.2012 21

MapReduce MapReduce is not intended to be a database system Shared-nothing parallel DBs Comes for free Not SQL compliant Not easily interface with existing business intelligence products 14.12.2012 22

MapReduce No native ability Ability would have to be provided using userdefined code Shared-nothing parallel DBs Not implemented the recent research results o Only in some cases simple operations (moving or copying encrypted data) are supported Advanced operations are only possible through user-defined functions 14.12.2012 23

Property MapReduce Shared-nothing parallel DBs 1. Efficiency 2. Fault Tolerance 3. Ability to run in a heterogeneous environment 4. Ability to interface with business intelligence products 5. Ability to operate on encrypted data A hybrid solution could have a significant impact on the cloud database market 14.12.2012 24

Recent work focuses mainly on language and interface issues: Integrate declarative query constructs into MapReduce-like software Ability to write MapReduce functions over data stored in their parallel database products But: Remains a need for a hybrid solution at the systems level 14.12.2012 25

1. How to combine the ease-of-use out-ofthe-box advantages of MapReduce with the efficiency and shared-work advantages that come with loading data and creating performance enhancing data structures? 2. How to balance the tradeoffs between fault tolerance and performance? Problem: Checkpointing intermediate results usually come at performance cost 14.12.2012 26

The paper answers the questions: What can we do on the cloud? What solution do we want for that? But: How can we use the Cloud today for data warehousing? Are there any useful products today? How can we implement the hybrid solution? 14.12.2012 27

The nodes can be for example Amazon EC2 instances, but where can we store the data? 14.12.2012 28

Is there any existent shared-nothing parallel data warehouse product in any cloud we can use? And if yes, how can we put our data in the cloud? 14.12.2012 29

Proposed solutions for the two open research questions: Incremental algorithms Data can be initially read directly off the file systems out of the box, but each time data is accessed, progress is made towards the many activities surrounding a DBMS load A system that can adjust its levels of fault tolerance on the fly given an observed failure rate Sounds nice, but are there any sophisticated concepts implemented or at least presented yet? 14.12.2012 30

14.12.2012 31