Data Sharing in the Cloud: Scaling to the World, Unleashing Creativity, and Generating Value? Marcos Vaz Salles Assistant Professor, University of Copenhagen (DIKU)
About the Speaker Marcos Vaz Salles Assistant Professor, University of Copenhagen (DIKU) Postdoc: Cornell University PhD: ETH Zurich Mission: Find creative ways to expand the reach of the 30+ years of top-level R&D invested in database technology, broadly defined Examples: Database techniques for search and integration, games, simulations, geospatial data 2
Where does your most important data live? 3
Where does your most important data live? DATABASES! 4
Historical Justification for Databases 5
Historical Justification for Databases Common applications Record maintenance, banking, government Complex implementation Concurrency, integrity, durability, storage, representation, Enough abstraction Operating systems virtualize low-level hardware Competing platforms No virtualization of platform: IBM, DEC, Data-Driven Applications Data Sharing (DBMS) Virtualization (Operating Systems) Platforms (Hardware) 6
Historical Justification for Databases Common applications Record maintenance, banking, government Complex implementation Concurrency, integrity, durability, storage, representation, Enough abstraction Operating systems virtualize low-level hardware Competing platforms No virtualization of platform: IBM, DEC, Data-Driven Applications But the Cloud today is completely different?! Data Sharing (DBMS) Virtualization (Operating Systems) Platforms (Hardware) 7
The Cloud Today Common applications Web Services, Data Warehousing, Big Data Complex implementation Data consistency and management, distribution, scalability, fault tolerance Enough abstraction Cloud IaaS virtualizes enormous clusters of machines Competing platforms No virtualization of platform: Amazon, Microsoft, Data-Driven Applications Data Sharing (????) Virtualization (Cloud IaaS) Platforms (Cloud Datacenter) 8
The Cloud Today Common applications Web Services, Data Warehousing, Big Data Complex implementation Data consistency and management, distribution, scalability, fault tolerance Enough abstraction Cloud IaaS virtualizes enormous clusters of machines Competing platforms No virtualization of platform: Amazon, Microsoft, Data-Driven Applications Challenge: What Data Sharing (????) should be the new Data Sharing Virtualization (Cloud IaaS) Abstraction in the Cloud? Platforms (Cloud Datacenter) 9
From Databases to Dataclouds While there were databases in the past, we will have dataclouds in the future Databases à Database Management System (DBMS) Dataclouds à Datacloud Management System (DCMS) Emerging application systems already being built! But at high cost And with less features than desired 10
Emerging Datacloud Application Systems Programmable news services Example: Guardian.co.uk Open Platform & MicroApps Programmable social networks Example: Apps on Facebook Programmable CRM Example: Salesforce Platform Far-fetched (?!) future Programmable government Programmable banking Programmable whoever-has-data 11
Emerging Datacloud Application Systems Programmable news services Example: Guardian.co.uk Open Platform & MicroApps Programmable social networks Example: Apps on Facebook Programmable CRM Example: Salesforce Platform Far-fetched (?!) future Programmable government Programmable banking Programmable whoever-has-data Data is a new means of production 12
Challenges in Dataclouds and DCMS Programming, programming, programming Resources, resources, resources Scale, scale, scale 13
Challenges in Dataclouds and DCMS Programming, programming, programming Re-use or create new programming abstractions? How to incorporate data into software engineering? Resources, resources, resources How to deal with virtualized environments and abstract cost? Scale, scale, scale How to scale applications to petabytes automatically? 14 Career Opportunity: DataCloud Administrator (DCA) J
ClouDiA: A Cloud Deployment Advisor Initial work on deployment of latency-sensitive data services in public clouds Simulation analytics, e.g., multi-agent simulations Search engines Key-value stores Acknowledgment: Joint work with Tao Zou, Ronan LeBras, Alan Demers, and Johannes Gehrke at Cornell University, to appear at VLDB 2013 15
Latency-sensitive Data Services Distributed, latency-sensitive applications Goal: Time-to-solution Goal: Service response time Communication graph: captures interaction among application nodes 16 grid tree bipartite
Running Example: Fish Schools
Latency in the Cloud Some links have far worse latency than others Mean link latency is fairly stable over time Mean latency measurement in Amazon EC2 100 large instances, 100 2 links, every hour, 10 days 18 TCP round-trip times of 1KB messages
Key Observations Observation #1: Avoid bad links Typical communication graph requires less links than complete graph Deploy application nodes to instances carefully Observation #2: Over-allocate to get better links Say communication graph has n nodes 19 Allocate, e.g., 1.1n instances Deploy and terminate extra 0.1n instances Why do we care? A) Improve response time B) Spend less money C) Get more bang for the buck
Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link Costs: 1 2 3 Communication Graph Instances Source: LeBras, Zou (partial) 20
Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link 2 1 2 3 4 5 6 1 3 4 5 6 Costs: 1 2 3 Communication Graph Objective function value = 3 21 7 8 9 7 8 Instances 9 Source: LeBras, Zou (partial)
Node Deployment by Example Simulation analytics Tick-based, synchronization end of every tick in a grid Objective: Minimize worst link 2 1 2 3 4 5 8 1 3 4 5 6 Costs: 1 2 3 Communication Graph Objective function value = 2 22 7 6 9 7 8 Instances 9 Source: LeBras, Zou (partial)
Summary of Node Deployment Objectives Minimize cost of worst link Minimize cost of longest path Optimization Methods Akin to graph embedding problem, but with minimization goals Mixed-integer programming (MIP) formulation for both objectives Constraint programming (CP) formulation also for worst link Greedy easy to beat Network measurements Staged message exchange to measure costs More details on the paper! 23
Experiments with ClouDiA on Amazon EC2 Workloads & Setup Behavioral simulation 24 Fish simulation by Couzin et al., Nature 2D mesh 100 Amazon EC2 large instances Minimize Worst Link objective Synthetic aggregation workload Models search engines, distributed text databases Multi-level aggregation tree 50 Amazon EC2 large instances Minimize Longest Path objective Key-value store workload Bipartite graph of front-end servers and storage servers 100 Amazon EC2 large instances Minimize Worst Link objective used, but not perfect fit
Overall Improvement: All Workloads 15%-55% reduction of time Aggregation query largest improvement 25
Effect of Over-Allocation: Behavioral Simulation Default uses first 100 instances always Improvements with ClouDiA: 16% without 26 over-allocation, 38% with 50% extra instances
Wrap-up Dataclouds and DCMS Programming, programming, programming Resources, resources, resources Scale, scale, scale ClouDiA An initial step in resource optimization in public clouds Next steps: Collaborate with us to build a DCMS! Tons of research challenges open We are already collaborating with Danish Geodata Agency (GST) We are looking for partners J Thank you! 27