How To Win At A Game Of Monopoly On The Moon



Similar documents
Getting Started with SandStorm NoSQL Benchmark

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

COURSE CONTENT Big Data and Hadoop Training

Implement Hadoop jobs to extract business value from large and varied data sets

Automating Big Data Benchmarking for Different Architectures with ALOJA

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Performance Analysis of Cloud Relational Database Services

A Generic Auto-Provisioning Framework for Cloud Databases

Big data blue print for cloud architecture

Duke University

Can the Elephants Handle the NoSQL Onslaught?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Hadoop and MySQL for Big Data

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

A Brief Introduction to Apache Tez

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Performance and Scalability Overview

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

BIG DATA TRENDS AND TECHNOLOGIES

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Maximum performance, minimal risk for data warehousing

Best Practices for Hadoop Data Analysis with Tableau

Designing a Data Solution with Microsoft SQL Server 2014

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Oracle Database 12c: Performance Management and Tuning NEW

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

Towards an Optimized Big Data Processing System

Parquet. Columnar storage for the people

Oracle Big Data SQL Technical Update

Herodotos Herodotou Shivnath Babu. Duke University

The Cubetree Storage Organization

Characterizing Task Usage Shapes in Google s Compute Clusters

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Hadoop & Spark Using Amazon EMR

Building your Big Data Architecture on Amazon Web Services

Performance and Scalability Overview

Unified Big Data Processing with Apache Spark. Matei

KNIME & Avira, or how I ve learned to love Big Data

SQL Server Instance-Level Benchmarks with HammerDB

Azure Data Lake Analytics

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

TRAINING PROGRAM ON BIGDATA/HADOOP

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Introduction to Decision Support, Data Warehousing, Business Intelligence, and Analytical Load Testing for all Databases

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Native Connectivity to Big Data Sources in MSTR 10

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

How To Handle Big Data With A Data Scientist

Adobe Deploys Hadoop as a Service on VMware vsphere

Enterprise GIS Architecture Deployment Options. Andrew Sakowicz

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Real Time Big Data Processing

Rackspace Cloud Databases and Container-based Virtualization

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Benchmarking Cassandra on Violin

XpoLog Competitive Comparison Sheet

CSE-E5430 Scalable Cloud Computing Lecture 2

The Inside Scoop on Hadoop

Virtualizing Apache Hadoop. June, 2012

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Apache Hadoop Ecosystem

Real-time Ad-hoc Analytics on S3 with MemSQL

Data Management in the Cloud

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Management & Analysis of Big Data in Zenith Team

CRITEO INTERNSHIP PROGRAM 2015/2016

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Cloudera Manager Training: Hands-On Exercises

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Next-Generation Cloud Analytics with Amazon Redshift

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Transcription:

Changing the Face of Database Cloud Services with Personalized Service Level Agreements Jennifer Ortiz, Victor Teixeira de Almeida, Magdalena Balazinska University of Washington, Computer Science and Engineering PETROBRAS S.A., Rio de Janerio, RJ, Brazil CIDR 2015 1

2

Many Data Management & Analytics Systems Available 3

Many Systems are Available as Cloud Services 4

Which Hadoop Version? Cloud Services Today Amazon EMR Pig or Hive? How many instances of the service? 5

Which Hadoop Version? Cloud Services Today Amazon EMR Pig or Hive? How many instances of the service? 6

Cloud Services Today BigQuery How long will my query take? 7

Cloud Services Can Do Better! System Internals 8

Cloud Services Can Do Better! System Internals 9

Cloud Services Can Do Better! Query: Query Capabilities Time Money SELECT FROM WHERE 10

A new proposal Time to Re-think the interface Hide details of cluster deployment and resources Show users monetary costs and performance estimates on their data Let users pick the desired trade-off between options shown Personalized Service Level Agreements 11

Tier 1: $0.10/hour Within 20 seconds: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Within 1 minute: SELECT <up to 5 attributes> FROM <JOIN Fact + 4 Dimensions> WHERE <up to 10% of data> A PSLA Example Fixed, hourly price Expected performance Templates capture capabilities Within 10 minutes: SELECT <up to 10 attributes> FROM <JOIN Fact + 8 Dimensions> WHERE <up to 100% of data> 12

Tier 1: $0.10/hour Within 20 seconds: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Within 1 minute: SELECT <up to 5 attributes> FROM <JOIN Fact + 4 Dimensions> WHERE <up to 10% of data> A PSLA Example Fixed, hourly price Expected performance Templates capture capabilities Tier 2: $0.50/hour Within 1 second: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Different tiers of service Within 10 minutes: SELECT <up to 10 attributes> FROM <JOIN Fact + 8 Dimensions> WHERE <up to 100% of data> 13

Goals Cloud C Database D 14

Goals Cloud C PSLA P Database D Money, Time, Capabilities 15

Goals Cloud C PSLAManager PSLA P Database D Money, Time, Capabilities 16

Goals Cloud C PSLAManager PSLA P Database D Money, Time, Capabilities 17

Example of a Real PSLA 18

TPC-H Star Schema Benchmark Based on TPC-H 10GB 19

Myria is a data management service in the cloud that we built at UW. It has a parallel, shared-nothing back-end query execution engine called MyriaX 20

PSLA for Myria 21

PSLA for Myria 22

PSLA for Myria 23

PSLA for Myria 24

PSLA for Myria 25

PSLAManager Cloud C PSLAManager PSLA P Database D 26

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 27

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 28

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 29

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 30

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 31

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 32

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 33

Query Workload Generation Which queries to generate? Joins drive performance Think about possible combinations of joins Consider: All possible 2-way joins Tables in Order by Size: Lineitem, Part, Customer, Supplier, Date (Lineitem Part) (Customer Date), (Lineitem Supplier), etc. Only consider most expensive queries Build toward more complex queries, include selections and projections 34

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 35

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 36

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 37

Seconds Seconds Tier Selection Runtime Distributions of Query Workload Per Configuration in Myria 350 300 250 200 EMD 17.43 150 100 50 EMD 7.07 EMD 13.74 0 0 2 4 6 8 10 12 14 16 18 Workers Workers (Configurations) 38

Tier Selection 39

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 40

Workload Compression STEP 1: Query Clustering Threshold-based Density-based 0 2 4 6 8 10 12 14 16 18 Workers 41

Workload Compression STEP 1: Query Clustering th 0 2 4 6 8 10 12 14 16 18 Workers 42

Workload Compression STEP 1: Query Clustering 0 2 4 6 8 10 12 14 16 18 Workers 43

Workload Compression STEP 1: Query Clustering Tier 1: $0.XX/hour Within Cluster Max Threshold: Query Templates in Cluster Within Cluster Max Threshold: Query Templates in Cluster 44

Time(s) Workload Compression STEP 2: Template Generation Queries Query Query Templates Dominance SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Configuration 45

Time(s) Workload Compression STEP 2: Template Generation Attributes Projected Query Dominance Tables SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Selectivity Configuration 46

Time(s) Workload Compression STEP 2: Template Generation Attributes Projected Query Dominance Tables SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Selectivity Given: Configuration 47

Time(s) Workload Compression STEP 2: Template Generation Configuration 48

Time(s) Workload Compression STEP 2: Template Generation Root Query Template: We call a query template a root query template if no other query template in the same cluster dominates it. Configuration 49

Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 50

Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Root Query Templates Queries Query Templates Configuration 1 Configuration 2 51

Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 Configuration 2 52

Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 Configuration 2 53

Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 Configuration 2 54

PSLA Quality Assessment 55

PSLA Quality Metrics PSLA Query Capabilities PSLA Complexity PSLA Performance Error Metric 56

Time(s) Time(s) Quality Metric Trade-offs High Complexity Low Error Low Complexity High Error Found that a log interval is best without tuning 57

PSLA Evaluation on Predicted Runtimes 58

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 59

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 60

Performance Model (Offline) Train model offline on other data and queries Cloud Configuration Query Feature Vector Est. Rows Est. IO Avg. Row runtime Query workload Predict runtime from query features Query Feature Vector Cloud Configuration Est. Rows Est. IO Avg. Row runtime Based on Predicting Multiple Metrics for Queries: Better Decisions enabled by Machine Learning [Ganapathi et. al. 2009] 61

Training Dataset Synthetic Dataset 10GB 6 Tables 61 Attributes 62

PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 63

Predicted Myria PSLA (Predicted Runtimes) 64

Looking Forward Direct extensions to the approach Add support for indexes Improve time predictions Longer-term future work Can we guarantee the runtimes? Can we update the PSLA as user queries data Goal is to show increasingly more complex queries Usability testing 65

Conclusion Many cloud DBMSs exist Require users to reason about resources We propose to re-think that interface Personalized Service Level Agreements Service Tiers Price/Capabilities/Performance Important direction for cloud DBMSs 66