Daniel J. Adabi Workshop presentation by Lukas Probst
3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted host 3. Data is replicated, often across large geographic distances 14.12.2012 2
Cloud 14.12.2012 3
Read is easy parallelizable Write needs propergation Shared-nothing architecture 14.12.2012 4
Encryption is necessary 14.12.2012 5
14.12.2012 6
Transactional data management (OLTP) Rely on the ACID guarantees that the database provide Write-intensive Analytical data management (OLAP) The scale of OLAP systems is generally larger than OLTP systems Read-mostly (or read-only) with occasional batch inserts Check if OLTP applications are likely to be deployed in the cloud 14.12.2012 7
None of the 4 big players has a sharednothing transactional database Non-trivial to implement one data is partitioned across sites transactions cannot be restricted to accessing data from a single site Main benefit (scalability) is less relevant 14.12.2012 8
CAP theorem: Chose at most two out of three properties Consistency vs. Availability The C part of ACID is typically compromised to yield reasonable system availability 14.12.2012 9
OLTP DBs contain complete set of operational data needed to power mission-critical business processes Data includes detail at the lowest granularity sensitive information Untrusted hosts are unacceptable 14.12.2012 10
OLTP applications are not wellsuited for cloud deployment 14.12.2012 11
Transactional data management (OLTP) Rely on the ACID guarantees that the database provide Write-intensive Analytical data management (OLAP) The scale of OLAP systems is generally larger than OLTP systems Read-mostly (or read-only) with occasional batch inserts Check if OLAP applications are likely to be deployed in the cloud 14.12.2012 12
Scalability is very important Shared-nothing architecture scales the best Data analysis workloads are easy to parallelize across nodes in a shared-nothing network Only infrequent writes 14.12.2012 13
A, C and I are easy to obtain only infrequent writes sufficient to perform the analysis on a recent snapshot Consistency tradeoffs are not problematic for analytical databases 14.12.2012 14
4 possibilities to handle sensitive data for analysis 1. Leave them out of the analytical data store 2. Include them after anonymization 3. Include them after encryption 4. Analyze only less granular versions of the data Untrusted hosts can be used for analysis 14.12.2012 15
OLAP applications are well-suited for cloud deployment Concentrate on Data Analysis (OLAP) 14.12.2012 16
Cloud DBMS Wish List Check how close two currently available solution attaining these properties MapReduce-like software (e.g., Hadoop) Commercially available shared-nothing parallel databases 14.12.2012 17
1. Efficiency 2. Fault Tolerance 3. Ability to run in a heterogeneous environment 4. Ability to interface with business intelligence products (virtualization, query generation, ) 5. Ability to operate on encrypted data 14.12.2012 18
MapReduce Shared-nothing parallel DBs MapReduce is much slower than alternative systems Was not designed for complete, end-to-end data analysis systems over structured data In structured data Queries tend to access only a subset of the data For the business-oriented data analysis market, MapReduce can be wildly inefficient Uses helper structures which accelerate the access The use of helper Structures outperforms MapReduce s brute-force strategy The one-time cost of their creation is outweighed by the benefit each time they are used 14.12.2012 19
MapReduce Designed with fault tolerance as a high priority Split 0 Split 1 Split 2 Split n read split 0 Worker... read split 0 Worker Worker Assign split 0 Master reassign split 0 Shared-nothing parallel DBs Most parallel database systems restart query upon a failure Designed to run in environments where failures are relatively rate This is not the case for Clouds Map 14.12.2012 20
MapReduce Designed to run in a heterogeneous environment Shared-nothing parallel DBs Generally designed to run on homogeneous equipment Split 0 Split 1 Split 2 Split n read split 0 read split 0 slow Worker Worker Worker Assign split 0 Master reassign split 0 if sill in progress when nearly all other workes have finished yet... Can significantly degrade performance if a small subset of nodes in the parallel cluster are performing particularly poor Map 14.12.2012 21
MapReduce MapReduce is not intended to be a database system Shared-nothing parallel DBs Comes for free Not SQL compliant Not easily interface with existing business intelligence products 14.12.2012 22
MapReduce No native ability Ability would have to be provided using userdefined code Shared-nothing parallel DBs Not implemented the recent research results o Only in some cases simple operations (moving or copying encrypted data) are supported Advanced operations are only possible through user-defined functions 14.12.2012 23
Property MapReduce Shared-nothing parallel DBs 1. Efficiency 2. Fault Tolerance 3. Ability to run in a heterogeneous environment 4. Ability to interface with business intelligence products 5. Ability to operate on encrypted data A hybrid solution could have a significant impact on the cloud database market 14.12.2012 24
Recent work focuses mainly on language and interface issues: Integrate declarative query constructs into MapReduce-like software Ability to write MapReduce functions over data stored in their parallel database products But: Remains a need for a hybrid solution at the systems level 14.12.2012 25
1. How to combine the ease-of-use out-ofthe-box advantages of MapReduce with the efficiency and shared-work advantages that come with loading data and creating performance enhancing data structures? 2. How to balance the tradeoffs between fault tolerance and performance? Problem: Checkpointing intermediate results usually come at performance cost 14.12.2012 26
The paper answers the questions: What can we do on the cloud? What solution do we want for that? But: How can we use the Cloud today for data warehousing? Are there any useful products today? How can we implement the hybrid solution? 14.12.2012 27
The nodes can be for example Amazon EC2 instances, but where can we store the data? 14.12.2012 28
Is there any existent shared-nothing parallel data warehouse product in any cloud we can use? And if yes, how can we put our data in the cloud? 14.12.2012 29
Proposed solutions for the two open research questions: Incremental algorithms Data can be initially read directly off the file systems out of the box, but each time data is accessed, progress is made towards the many activities surrounding a DBMS load A system that can adjust its levels of fault tolerance on the fly given an observed failure rate Sounds nice, but are there any sophisticated concepts implemented or at least presented yet? 14.12.2012 30
14.12.2012 31