Data Sharing in the Cloud: Scaling to the World, Unleashing Creativity, and Generating Value?



Similar documents
WOLKEN KOSTEN GELD GUSTAVO ALONSO SYSTEMS GROUP ETH ZURICH

Multilevel Communication Aware Approach for Load Balancing

From Spark to Ignition:

Divy Agrawal and Amr El Abbadi Department of Computer Science University of California at Santa Barbara

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Cloud Computing Technology

Lecture 26 Enterprise Internet Computing 1. Enterprise computing 2. Enterprise Internet computing 3. Natures of enterprise computing 4.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Cloud Computing. Chapter 1 Introducing Cloud Computing

Scientific and Technical Applications as a Service in the Cloud

There Are Clouds In Your Future. Jeff Barr Amazon Web (Twitter)

bigdata Managing Scale in Ontological Systems

INTRODUCTION TO CASSANDRA

Alfresco Enterprise on AWS: Reference Architecture

How To Understand Cloud Computing

Apache Hadoop. Alexandru Costan

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Networking in the Hadoop Cluster

Data Management in the Cloud. Zhen Shi

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Windows Azure and private cloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

A Cost-Evaluation of MapReduce Applications in the Cloud

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Challenges for Data Driven Systems

Chapter 7. Using Hadoop Cluster and MapReduce

The Sierra Clustered Database Engine, the technology at the heart of

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Report Data Management in the Cloud: Limitations and Opportunities

CLOUD COMPUTING. When It's smarter to rent than to buy

Domain driven design, NoSQL and multi-model databases

Introduction to Cloud Computing

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Advanced Computer Networks. Scheduling

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Cloud Platforms Today: The Big Picture

Are You Ready for Big Data?

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Cloud Computing Services and its Application

PaaS Cloud Migration Migration Process, Architecture Problems and Solutions. Claus Pahl and Huanhuan Xiong

Keywords Cloud computing, virtual machines, migration approach, deployment modeling

DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING. Carlos de Alfonso Andrés García Vicente Hernández


Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

International Journal of Engineering Research & Management Technology

Distributed Data Stores

A programming model in Cloud: MapReduce

A1 and FARM scalable graph database on top of a transactional memory layer

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

How To Handle Big Data With A Data Scientist

2) Xen Hypervisor 3) UEC

Cloud Computing. Key Considerations for Adoption. Abstract. Ramkumar Dargha

Alfresco Enterprise on Azure: Reference Architecture. September 2014

Data Consistency on Private Cloud Storage System

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Lecture Data Warehouse Systems

Cloud Computing: Meet the Players. Performance Analysis of Cloud Providers

An Introduction to Private Cloud

Microsoft Private Cloud

Software as a Service (SaaS) and Platform as a Service (PaaS) (ENCS 691K Chapter 1)

Big Data Analytics. Chances and Challenges. Volker Markl

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Distributed System Principles

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

BIG DATA SOLUTION DATA SHEET

Real Time Big Data Processing

Amazon EC2 XenApp Scalability Analysis

How To Understand Cloud Computing

Webpage: Volume 3, Issue XI, Nov ISSN

Can the Elephants Handle the NoSQL Onslaught?

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Cloud Computing. Chapter 1 Introducing Cloud Computing

High Performance Cluster Support for NLB on Window

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Harnessing the power of advanced analytics with IBM Netezza

Scaling Database Performance in Azure

So What s the Big Deal?

Cloud Computing and Advanced Relationship Analytics

Cloud Computing and the Future of Internet Services. Wei-Ying Ma Principal Researcher, Research Area Manager Microsoft Research Asia

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data and Industrial Internet

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

Transcription:

Data Sharing in the Cloud: Scaling to the World, Unleashing Creativity, and Generating Value? Marcos Vaz Salles Assistant Professor, University of Copenhagen (DIKU)

About the Speaker Marcos Vaz Salles Assistant Professor, University of Copenhagen (DIKU) Postdoc: Cornell University PhD: ETH Zurich Mission: Find creative ways to expand the reach of the 30+ years of top-level R&D invested in database technology, broadly defined Examples: Database techniques for search and integration, games, simulations, geospatial data 2

Where does your most important data live? 3

Where does your most important data live? DATABASES! 4

Historical Justification for Databases 5

Historical Justification for Databases Common applications Record maintenance, banking, government Complex implementation Concurrency, integrity, durability, storage, representation, Enough abstraction Operating systems virtualize low-level hardware Competing platforms No virtualization of platform: IBM, DEC, Data-Driven Applications Data Sharing (DBMS) Virtualization (Operating Systems) Platforms (Hardware) 6

Historical Justification for Databases Common applications Record maintenance, banking, government Complex implementation Concurrency, integrity, durability, storage, representation, Enough abstraction Operating systems virtualize low-level hardware Competing platforms No virtualization of platform: IBM, DEC, Data-Driven Applications But the Cloud today is completely different?! Data Sharing (DBMS) Virtualization (Operating Systems) Platforms (Hardware) 7

The Cloud Today Common applications Web Services, Data Warehousing, Big Data Complex implementation Data consistency and management, distribution, scalability, fault tolerance Enough abstraction Cloud IaaS virtualizes enormous clusters of machines Competing platforms No virtualization of platform: Amazon, Microsoft, Data-Driven Applications Data Sharing (????) Virtualization (Cloud IaaS) Platforms (Cloud Datacenter) 8

The Cloud Today Common applications Web Services, Data Warehousing, Big Data Complex implementation Data consistency and management, distribution, scalability, fault tolerance Enough abstraction Cloud IaaS virtualizes enormous clusters of machines Competing platforms No virtualization of platform: Amazon, Microsoft, Data-Driven Applications Challenge: What Data Sharing (????) should be the new Data Sharing Virtualization (Cloud IaaS) Abstraction in the Cloud? Platforms (Cloud Datacenter) 9

From Databases to Dataclouds While there were databases in the past, we will have dataclouds in the future Databases à Database Management System (DBMS) Dataclouds à Datacloud Management System (DCMS) Emerging application systems already being built! But at high cost And with less features than desired 10

Emerging Datacloud Application Systems Programmable news services Example: Guardian.co.uk Open Platform & MicroApps Programmable social networks Example: Apps on Facebook Programmable CRM Example: Salesforce Platform Far-fetched (?!) future Programmable government Programmable banking Programmable whoever-has-data 11

Emerging Datacloud Application Systems Programmable news services Example: Guardian.co.uk Open Platform & MicroApps Programmable social networks Example: Apps on Facebook Programmable CRM Example: Salesforce Platform Far-fetched (?!) future Programmable government Programmable banking Programmable whoever-has-data Data is a new means of production 12

Challenges in Dataclouds and DCMS Programming, programming, programming Resources, resources, resources Scale, scale, scale 13

Challenges in Dataclouds and DCMS Programming, programming, programming Re-use or create new programming abstractions? How to incorporate data into software engineering? Resources, resources, resources How to deal with virtualized environments and abstract cost? Scale, scale, scale How to scale applications to petabytes automatically? 14 Career Opportunity: DataCloud Administrator (DCA) J

ClouDiA: A Cloud Deployment Advisor Initial work on deployment of latency-sensitive data services in public clouds Simulation analytics, e.g., multi-agent simulations Search engines Key-value stores Acknowledgment: Joint work with Tao Zou, Ronan LeBras, Alan Demers, and Johannes Gehrke at Cornell University, to appear at VLDB 2013 15

Latency-sensitive Data Services Distributed, latency-sensitive applications Goal: Time-to-solution Goal: Service response time Communication graph: captures interaction among application nodes 16 grid tree bipartite

Running Example: Fish Schools

Latency in the Cloud Some links have far worse latency than others Mean link latency is fairly stable over time Mean latency measurement in Amazon EC2 100 large instances, 100 2 links, every hour, 10 days 18 TCP round-trip times of 1KB messages

Key Observations Observation #1: Avoid bad links Typical communication graph requires less links than complete graph Deploy application nodes to instances carefully Observation #2: Over-allocate to get better links Say communication graph has n nodes 19 Allocate, e.g., 1.1n instances Deploy and terminate extra 0.1n instances Why do we care? A) Improve response time B) Spend less money C) Get more bang for the buck

Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link Costs: 1 2 3 Communication Graph Instances Source: LeBras, Zou (partial) 20

Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link 2 1 2 3 4 5 6 1 3 4 5 6 Costs: 1 2 3 Communication Graph Objective function value = 3 21 7 8 9 7 8 Instances 9 Source: LeBras, Zou (partial)

Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link 2 1 2 3 4 5 8 1 3 4 5 6 Costs: 1 2 3 Communication Graph Objective function value = 2 22 7 6 9 7 8 Instances 9 Source: LeBras, Zou (partial)

Summary of Node Deployment Objectives Minimize cost of worst link Minimize cost of longest path Optimization Methods Akin to graph embedding problem, but with minimization goals Mixed-integer programming (MIP) formulation for both objectives Constraint programming (CP) formulation also for worst link Greedy easy to beat Network measurements Staged message exchange to measure costs More details on the paper! 23

Experiments with ClouDiA on Amazon EC2 Workloads & Setup Behavioral simulation 24 Fish simulation by Couzin et al., Nature 2D mesh 100 Amazon EC2 large instances Minimize Worst Link objective Synthetic aggregation workload Models search engines, distributed text databases Multi-level aggregation tree 50 Amazon EC2 large instances Minimize Longest Path objective Key-value store workload Bipartite graph of front-end servers and storage servers 100 Amazon EC2 large instances Minimize Worst Link objective used, but not perfect fit

Overall Improvement: All Workloads 15%-55% reduction of time Aggregation query largest improvement 25

Effect of Over-Allocation: Behavioral Simulation Default uses first 100 instances always Improvements with ClouDiA: 16% without 26 over-allocation, 38% with 50% extra instances

Wrap-up Dataclouds and DCMS Programming, programming, programming Resources, resources, resources Scale, scale, scale ClouDiA An initial step in resource optimization in public clouds Next steps: Collaborate with us to build a DCMS! Tons of research challenges open We are already collaborating with Danish Geodata Agency (GST) We are looking for partners J Thank you! 27