Data Management in the Cloud: Limitations and Opportunities. Annies Ductan



Similar documents
Daniel J. Adabi. Workshop presentation by Lukas Probst

Cloud Compu)ng. Yeow Wei CHOONG Anne LAURENT

Ethopian Database Management system as a Cloud Service: Limitations and advantages

Report Data Management in the Cloud: Limitations and Opportunities

Using RDBMS, NoSQL or Hadoop?

Data Management in the Cloud. Zhen Shi

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

CS 4604: Introduc0on to Database Management Systems

Cloud Compu)ng: Overview & challenges. Aminata A. Garba

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

How To Understand Cloud Compueng

Introduc)on to Map- Reduce. Vincent Leroy

SQream Technologies Ltd - Confiden7al

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

In Memory Accelerator for MongoDB

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

I/O Considerations in Big Data Analytics

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Interna'onal Standards Ac'vi'es on Cloud Security EVA KUIPER, CISA CISSP HP ENTERPRISE SECURITY SERVICES

Cloud Compu)ng in Educa)on and Research

SharePoint Capacity Planning Balancing Organiza,onal Requirements with Performance and Cost

Chapter 3. Database Architectures and the Web Transparencies

Introduction to Cloud Computing

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Computing at Google. Architecture

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

A Comparison of Approaches to Large-Scale Data Analysis

An Open Dynamic Big Data Driven Applica3on System Toolkit

A1 and FARM scalable graph database on top of a transactional memory layer

Data Management in the Cloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Data Management in the Cloud: Limitations and Opportunities

Amazon Elastic Compute Cloud Getting Started Guide. My experience

CSE-E5430 Scalable Cloud Computing Lecture 2

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

bigdata Managing Scale in Ontological Systems

File System Reliability (part 2)

An to Big Data, Apache Hadoop, and Cloudera

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Amazon EC2 Product Details Page 1 of 5

Internet Storage Sync Problem Statement

Everything You Need to Know about Cloud BI. Freek Kamst

Hadoop & its Usage at Facebook

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Phone Systems Buyer s Guide

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

UAB Cyber Security Ini1a1ve

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Apache Hadoop FileSystem and its Usage in Facebook

Can the Elephants Handle the NoSQL Onslaught?

Hadoop Architecture. Part 1

NextGen Infrastructure for Big DATA Analytics.

Cloud Computing. Cloud computing:

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Improving MapReduce Performance in Heterogeneous Environments

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Building your Big Data Architecture on Amazon Web Services

Bringing Big Data Modelling into the Hands of Domain Experts

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Apache Hadoop. Alexandru Costan

Overview: X5 Generation Database Machines

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Challenges for Data Driven Systems

DNS Big Data

Backing up to the Cloud

DDC Sequencing and Redundancy

BIG DATA What it is and how to use?

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

2009 Oracle Corporation 1

Migrating to Hosted Telephony. Your ultimate guide to migrating from on premise to hosted telephony.

How AWS Pricing Works May 2015

Inge Os Sales Consulting Manager Oracle Norway

Cloud Computing and Amazon Web Services

Elastic Data Warehousing in the Cloud Is the sky really the limit?

Leveraging Public Clouds to Ensure Data Availability

Hadoop & its Usage at Facebook

Transcription:

Data Management in the Cloud: Limitations and Opportunities Annies Ductan

Discussion Outline: Introduc)on Overview Vision of Cloud Compu8ng Managing Data in The Cloud Cloud Characteris8cs Data Management Applica8ons Transac8onal Analy8cal Analyzing Data in The Cloud Ideal Proper8es and Features MapReduce- like SoEware Shared- Nothing Parallel Databases Hybrid Solu8on Summary

Overview: There is a need for research and engineering work to be done in crea3ng a hybrid MapReduce/Parallel Database System Three Main Topics Discussed: Features a DBMS should implement for large scale data analysis Available op8ons through open source and commercial database The need for new DBMS Design specifically for cloud environments

Vision: Present: Moving computer processing, storage, and soeware delivery away from the desktop and local servers Future: Into next genera8on data centers hosted by large infrastructure companies such as: Benefits: Amazon Google Yahoo MicrosoC Sun Corpora8ons are free from large IT capital investments Enables corpora8ons with extremely powerful compu8ng resources over the network

Vision: Advantages: Ability for elas8c compu8ng Resources are available for growth spikes Low cost Pay only for what is use: Disadvantages: Example: Amazon's EC2 has compu8ng resources for virtual private server instances: Small Large Extra- large Privacy viola8on possible Hos8ng company has ability to access customers data without permission

Managing Data In the Cloud: Amazon EC2 Instance Types Available Instance Types Small Instance default 1.7 GB memory 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) 160 GB instance storage 32- bit or 64- bit pla]orm I/O Performance: Moderate API name: m1.small Medium Instance 3.75 GB memory 2 EC2 Compute Unit (1 virtual core with 2 EC2 Compute Unit) 410 GB instance storage 32- bit or 64- bit pla]orm I/O Performance: Moderate API name: m1.medium Large Instance 7.5 GB memory 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) 850 GB instance storage 64- bit pla]orm I/O Performance: High API name: m1.large Extra Large Instance 15 GB memory 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) 1,690 GB instance storage 64- bit pla]orm I/O Performance: High API name: m1.xlarge

Managing Data In the Cloud: Three characteris)cs of cloud: Compu)ng power is elas)c if the workload is parallelizable: Obtaining addi8onal resources by alloca8ng addi8onal server instances to a task Suitable for applica8ons designed to run on shared- nothing architecture Only useful if the applica8on is able to take advantage of the addi8onal server instance by offloading some of its required work to the new instances running in parallel with the old instances Data is stored at an untrusted host: US Patriot Act allows the government to demand access to the data stored on any computer Most cloud compu8ng vendors give the customer lidle control over where data is stored Example: Amazon S3 only allows a customer to choose between US and EU data storage Customer have lidle choice unless the data is encrypted using a key not located at the host Data is replicated across large geographic distances: Allows cloud compu8ng providers have to provide high levels of fault tolerance by replica8ng data across large geographic distances

Managing Data In the Cloud: Data Management Market : Composed of two large components: Transac'onal Data Management Databases that back: Banking Airline Reserva8on Online e- commerce Supply chain management Analy'cal Data Management Applica8ons querying a data store for: Business planning Problem Solving Decision Support

Managing Data In the Cloud: Transac8onal vs. Analy8cal Characteris'cs of Transac'onal Data Management: Shared- nothing architecture Typically not used ACID guarantees Security Hard to maintain Risk involved with untrusted host Characteris'cs of Analy'cal Data Management: Shared- nothing architecture Good Match ACID guarantees Security Typically not needed Sensi8ve data not used in analysis

Managing Data In the Cloud: Characteris)cs of Transac)onal Data Management: Rely on ACID guarantees of databases Tend to be write- intensive Not likely to be stored in the cloud: Do not typically use a shared- nothing architecture Difficul8es in maintain ACID guarantees replica8ng data over large geographic distances Risks storing transac8onal data in an untrusted host

Managing Data In the Cloud: Transac8onal Data Management: Shared- nothing Architecture know for scalability: Not relevant to Transac8onal Data Management: Majority of deployments are less than 1TB in size Data is par88oned across sites Transac8ons can not be restricted to accessing data from a single site What this means for cloud compu8ng: Complex distributed locking and commit protocols, and in shipping data over a network Increased latency and poten8al network bandwidth bodlenecks Transac8onal Database Providers: Oracle Implemented the Oracle Database Machine, uses a shared- nothing architecture at the storage layer designed only to be used for data warehouses IBM DB2 Created an add- on feature Database Par88oning Feature(DPF) to Flagship product designed to scale analy8cal applica8on running on data warehouses Microso@ SQL Server (No shared- nothing architecture) Sybase (No shared- nothing architecture)

Managing Data In the Cloud: Transac8onal Data Management: Maintain ACID guarantees Distributed computer systems must choose between: Consistency Availability Par88on tolerance When replica8ng data over a wide area it leaves consistency and availability to choose from compromising part of the ACID to yield a reasonable system availability Consistency Availability

Managing Data In the Cloud: Transac8onal Data Management: Security Type of informa8on gathered from TDMS Opera8onal data needed to power mission- cri8cal business processes such as: Customer data Credit card numbers Because of this they conclude that TDM applica8ons are not well suited for cloud deployment Companies adempt to implement TDM products for running in Amazon s cloud: EnterpriseDB s Postgres Plus Advanced Server Oracle

Managing Data In the Cloud: Characteris)cs of Analy)cal Data Management: Shared- nothing architecture is a good match for analy8cal data management ACID guarantees are typically not needed Sensi8ve data can oeen be lee out of the analysis

Managing Data In the Cloud: Analy8cal Data Management: Shared- nothing architecture Primary driver is the increasing amount of data involved in data analysis workloads consis8ng of: Large scans Mul8dimensional aggrega8ons Star schema joins Making it easy to parallelize across nodes in a shared- nothing network Infrequent writes eliminates the need for complex distributed locking and commit protocols.

Managing Data In the Cloud: Analy8cal Data Management: ACID guarantees Makes atomicity, consistency, and isola'on of ACID easy to obtain due to: Infrequent writes in the analy8cal database workloads The fact that data can be performed on a recent snapshot of the data and not the most recent data.

Managing Data In the Cloud: Analy8cal Data Management: Security Sensi8ve data can be lee out of the analysis Less granular data can be analyzed instead of the lowest level of data Three op8ons for highly sensi8ve data: Leave it out of the analy8cal data store Include it aeer anonymiza8on func8on Include aeer encryp8ng

Data Analysis in the Cloud: Ideal Proper)es and Features in cloud DBMS: Efficiency Fault Tolerance Ability to run in a heterogeneous environment Ability to operate on encrypted data Ability to interface with business intelligence products

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware Fault Tolerance: Designed with fault tolerance as a high priority. Data analysis job is divided into many small tasks Tasks assigned to a failed machine are transparently reassigned to another machine. MapReduce experiment: Killing 200 out of 1746 worker processes involved in a MapReduce job resulted in only a 5% degrada3on in query performance

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware Ability to run in a heterogeneous environment Towards the end of a MapReduce job tasks that are s8ll in progress get redundantly executed on other machines Task are marked as completed as soon as the primary or the backup execu8on has completed: Limits the effect straggler machines can have on total query 8me Backup execu8ons assigned to these machines will complete first MapReduce Experiment: It was shown that backup task execu8on improves query performance by 44% by allevia8ng the affects caused by slower machines

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware Figure(b) Show an execu8on of the sort program with backup tasks disabled. Results: Execu3on flow is similar Figure (a) except that there is a very long tail where write ac3vity barely occurs. AEer 960 seconds, all except 5 of the reduce tasks are completed. The last few stragglers don t finish un8l 300 seconds later The en8re computa8on takes 1283 seconds, an increase of 44% in elapsed 8me

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware Ability to operate on encrypted data No ability to operate on encrypted data Needs to be provided by user- defined code Ability to interface with business intelligence products Does not easily interface with exis8ng business intelligence products MapReduce was not intended to be a database system Not SQL compliant

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware The Grep program: Scans through 100- byte records searching for rare three- character paderns Paderns occur in 92,337 records The input is split into 64MB pieces (M = 15000) The en8re output is placed in one file (R = 1)

Data Analysis in the Cloud: Available op8ons for open source and commercial database: MapReduce- like solware Efficiency MapReduce experiment: A rare string is searched for inside a 1TB dataset 1TB of data is read off 3600 disks in the cluster (in parallel) where a pa^ern search is performed The en3re Grep query took 150 seconds to complete Dividing 1TB of data by the 3600 disks and 150 seconds to run the query: Resulted in an average throughput of less than 2 MB/s/disk At peak performance MapReduce was reading data at 32GB/s which is less than 10MB/s/disk. Conclusion: There is a need for improvement: Given the long start- up 3me to get to peak performance, and that peak performance is four to six 'mes slower than how fast disks in the cluster could actually be read

Data Analysis in the Cloud: Available op8ons for open source and commercial database: Shared- Nothing Parallel Databases Ability to interface with business intelligence products Not designed to work on top of databases Efficiency Improve query performance by implemen8ng: Indexes Materialized views Compressions Fault Tolerance Most parallel database systems restart a query upon a failure Designed for environments where queries take no more than a few hours and run on a few hundred machines Failures are rare so an occasional query restart is not problema8c Cloud environment offer: Machines are cheaper Less reliable Less powerful Failures more common

Data Analysis in the Cloud: Available op8ons for open source and commercial database: Shared- Nothing Parallel Databases Ability to run in a heterogeneous environment Designed to run on homogeneous equipment Suscep8ble to degraded performance: If a small subset of nodes in the parallel cluster are performing poorly Ability to operate on encrypted data Not implement Encryp8on needs to be hand- code to support user defined func8ons

Data Analysis in the Cloud: A Call For A Hybrid Solu)on: Map- Reduce- like soe and Parallel databases are ideal solu8ons for data analysis in the cloud Each provide certain features that meets the five recommenda8on solu8on for impac8ng cloud market: Hybrid solu8on combining: Fault tolerance Heterogeneous cluster Ease of use capabili8es MapReduce + Parallel Database Systems = efficiency, performance, tool plugability

Data Analysis in the Cloud: A Call For A Hybrid Solu)on: Current Solu8ons : Aim to integrate declara8ve queries from the database community into MapReduce- like soeware Provideds greater data independence, code reusability, and automa8c query op8miza8on Pig project (Yahoo) SCOPE project (MicrosoE) Greenplum and Aster Data Added the ability to write MapReduce func8ons over data stored in their parallel database products

Data Analysis in the Cloud: A Call For A Hybrid Solu)on: There remains a need for a hybrid solu8on at the systems level in addi8on to at the language level: Future research: How to combine the advantages of MapReduce- like soeware with the efficiency and shared- work advantages that come with loading data and crea8ng performance enhancing data structures? Proposal: Incremental algorithms Allows data to be ini8ally read directly off the file system Each 8me data is accessed progress is made towards the many ac8vi8es surrounding a DBMS load Compression Index Materialized view crea8on

Data Analysis in the Cloud: A Call For A Hybrid Solu)on: There remains a need for a hybrid solu8on at the systems level in addi8on to at the language level: Future research: How to balance the tradeoffs between fault tolerance and performance? Through fault tolerance maximiza8on by carefully check poin8ng intermediate results at the cost of performance Providing a system that can adjust its levels of fault tolerance on the fly given an observed failure rate.

Summary: The advantages and disadvantages of deploying database systems in the cloud Read- mostly analy8cal database management applica8ons are best suited for cloud deployment, than transac8onal data management applica8ons Large scale data analysis is not suited for cloud deployment and there is a need for a new design of DBMS to handle this type of analysis The need for hybrid solu8ons at the systems level that combines the features of MapReduce and Parallel Database Systems

Ques8ons?