Sherpa: Cloud Computing of the Third Kind Raghu Ramakrishnan Yahoo! and Platform Engineering Team
What s in a Name? Data Intensive Super Scalable Computing Grid Computing Super Computing Cloud Computing Parallel Database Management Systems Distributed Database Management Systems Vary across: Workload, Programming model, Ownership model, Architectural trade-offs - 2 -
Cloud Computing: Computing as a Service Packaged Software Cloud Computing CPU Intensive Data Intensive High-throughput E.g., Condor Transactional Storage & Serving E.g., PNUTS, S3, SSDS, UDB Analytic E.g., SSDS, Hadoop - 3 -
Trivia Question What s the world s most widely used parallel programming language? - 4 -
Why Not Use an RDBMS for Analytics? RDBMS provides too much ACID transactions Complex query language Lots and lots of knobs to turn RDBMS provides too little Lots of optimization and tuning possible for analytics E.g., Column stores, bit-map indexes Flexible programming model E.g., Group By vs. Map-Reduce; multi-dimensional OLAP But many good ideas to borrow! Declarative language; parallelization and optimization techniques; value of data consistency - 5 -
Why Not Use an RDBMS for OLTP? RDBMS provides too much ACID transactions Complex query language Lots and lots of knobs to turn RDBMS provides too little Lack of (cost-effective) scalability, availability Not enough schema/data type flexibility RDBMS and Sherpa aim for different parts of the space RDBMS: Heavyweight, strongly consistent OLTP Sherpa: Lightweight but massive scale, relaxed consistency OLTP - 6 -
I want a big, virtual database What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in and out of service at will, read-write replication and data migration all done automatically. I want to be able to install a database on a server cloud and use it like it was all running on one machine. -- Greg Linden s blog, 2006 We re building a hosted version of such a system - 7 -
An Example Web App Heavy use of simple database operations Updates uploads tags as flower» Your Photos Queries» Photos tagged as flower» Friend activity Sonja uploaded Brandon tagged a photo - 8 -
Why Hosted? simple API Rapid application development On-demand scaling DBA functions amortized across applications - 9 -
Rapid Application Development What does it take to get the Next Great Thing off the ground? Now: Set up multiple replicas of a clustered data store Set up a system for indexing Set up a system for caching Set up auxiliary DBMS instances for reporting, etc. Set up the feeds and messaging between them Write the application logic Fairly complex system at first line of new code Our vision: Write the application logic Use a hosted infrastructure to store and query your data Or, as Joshua Shachter puts it: The next cool thing shouldn t take a team of 30, it should be three guys, PHP and a long weekend - 10 -
Implications Data management as a service Scientists and others who ve resisted (installing, maintaining, and) using DBMSs will find it much easier to reap the benefits Data centers and Computing Centers will come into vogue again The Web is becoming open E.g., OpenSocial, OpenID Hosted back-ends and RAD tools will make Web application development accessible to all Ideas will be the most valuable currency, not the wherewithal to build complex systems Paradigm shifts possible for how we do research in many fields: Build applications that embed your algorithms and test them directly in the field Computer Scientists can interact directly with users (ironically, this would still be a breakthrough of sorts after four decades!) Many other disciplines (e.g., Sociology, microeconomics) can design and conduct online experiments involving unprecedented numbers of participants - 11 -
PNUTS: DB in the Cloud A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Indexes Indexes and and views views A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Parallel Parallel database database CREATE TABLE Parts (( ID ID VARCHAR, StockNumber INT, Status VARCHAR )) A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Geographic Geographic replication replication Structured, Structured, flexible flexible schema schema Hosted, Hosted, managed managed infrastructure infrastructure - 12 -
Sherpa Data Services Applications PNUTS Services Query planning and execution Index maintenance YCA: Authorization Distributed infrastructure for tabular data Data partitioning Update consistency Replication YDOT FS Ordered tables YDHT FS Hash tables YMB Pub/sub messaging Zookeeper Consistency service - 13 -
Guiding Principles for PNUTS Reliable and robust storage Replication for fault tolerance Predictable consistency guarantees Simple to use Simple operations set Minimal client configuration Service-level authentication Flexible schemas Highly Scalable / Performant Partitioning data over many machines Horizontal scaling at every level Data is local to its usage Predictable performance via quality of service levels Predicates evaluated on back end Cheaper consistency guarantees than full ACID - 14 - Multiple rich access methods Hash and ordered table types System-maintained secondary indexes Optimization for complex access patterns Rapid provisioning of new storage Simple, automated cluster growth Cheap table creation Pay as you grow, grow big as you need Operationally cheap Automated failover Automated load balancing No single points of failure Hosted platform
Data Model and Retrieval YDOT/YDHT Data model: Key value dictionary Value can be packed with multiple attributes YDHT operations: Hash table calls Get Set (insert and update) Remove Scan YDOT: YDHT + ordered ranges PNUTS Data model: Relational tables with flexible schema Typed, declared attributes Fast addition of new attributes Operations: PNUTS query language Point lookup Range queries Insert/Update/Remove Complex predicates Ordering Top-K Primary API is web services (JSON over HTTP) Client libraries for various languages (PHP, C++, Java, ) - 15 -
YDHT Scalable distributed record store Optimized for small reads and writes Focus on ease of operations, multi-region redundancy, organic scalability Storage as a service Clients Tablet Controller Routers Storage servers - 16 -
Ways to use YDHT As a primary store APP YDHT As a materialized view/cache APP YDHT Primary store As part of PNUTS! APP PNUTS YDHT - 17 -
Data Concepts YDHT Table Primary key Record Tablet Grape Lime Apple Strawberry Orange Avocado Lemon Tomato Banana Kiwi Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Don t get scurvy! But at what price? How much did you pay for this lemon? Is this a vegetable? The perfect fruit New Zealand - 18 - Fields
Data Concepts YDOT Ordered by primary key Tablets contain clustered ranges Apple Avocado Banana Grape Kiwi Lemon Lime Orange Strawberry Tomato Apple is wisdom But at what price? The perfect fruit Grapes are good to eat New Zealand How much did you pay for this lemon? Limes are green Arrgh! Don t get scurvy! Strawberry shortcake Is this a vegetable? - 19 -
YDOT Ordered Table Store YDOT provides clustered, ordered retrieval of records Apple Avocado Banana Blueberry Grapefruit Pear? Canteloupe Grape Kiwi Lemon Lime Mango Orange Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Router Lime Pear? Grapefruit Lime? Apple Strawberry Avocado Tomato Banana Watermelon Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon Storage unit 1 Storage unit 2 Storage unit 3-20 -
Data Concepts PNUTS Schema: declared, typed fields Name Description Price Apple Apple is wisdom $1 Avocado But at what price? $3 Banana The perfect fruit $2 Grape Grapes are good to eat $12 Kiwi New Zealand $8 Retains tablet structure of YDHT/YDOT Lemon Lime How much did you pay for this lemon? Limes are green $1 $9 Orange Arrgh! Don t get scurvy! $2 Strawberry Strawberry shortcake $900 Tomato Is this a vegetable? $14-21 -
Flexible Schema Primary table Posted date Listing id Item Price 6/1/07 424252 Couch $570 6/1/07 763245 Bike $86 6/3/07 211242 Car $1123 6/5/07 421133 Lamp $15 Color Red Condition Good Fair - 22 -
Asynchronous Replication - 23 -
Mastering A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Tablet master A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E - 24 -
Basic Consistency Model Goal: Make it easier for applications to reason about updates and cope with asynchrony alternative to transactions in an asynchronous world What happens to a record with primary key Brian? Record inserted Update Update Delete Record Update inserted Update Update Delete Record inserted Delete v. 1 v. 2 v. 3 Generation 1 v. 1 v. 2 v. 3 v. 4 Generation 2 v. 1 Generation 3 Time Guarantees: Every reader will always see some consistent, but possibly stale version Readers can request a more up-to-date version, but may pay extra latency Special case: Critical read (writer/readers see their own writes) Writers can verify that the record is still at the version they expect - 25 -
Distribution 6/1/07 424252 6/1/07 256623 Couch $570 Data Distribution shuffling for for load parallelism load balancing Car $1123 6/2/07 636353 6/5/07 662113 6/7/07 121113 6/9/07 887734 6/11/07 252111 6/11/07 116458 Bike $86 Chair $10 Lamp $19 Bike $56 Scooter $18 Hammer $8000 Server 1 Server 2 Server 3 Server 4-26 -
Tablet Splitting and Balancing Each Each storage unit unit has has many many tablets tablets Storage unit unit may may become a hotspot hotspot Overfull tablets tablets split split Tablets Tablets may may grow grow over over time time Shed Shed load load by by moving moving tablets tablets to to other other servers servers - 27 -
Architecture Data-path components Clients Each can can be be scaled horizontally Tablet map Load balancer Server monitor Tablet controller Routers WS API YMB SU API Storage units Cluster 1 Cluster 2 Query processing - 28 -
Yahoo! Message Broker (YMB) Pub/sub based on reliable logging Topic-based Persistent subscriptions Multi-region presence Guarantees In the presence of at most one YMB machine failure: Published messages will be delivered on live subscriptions system-wide Messages published in one region will be delivered to all subscribers in the order they were published (partial order) Published messages available for re-delivery until subscriber calls consume() If there are two machine failures: Subscribers will be notified of broken subscription Since messages may have been lost Uses in YDHT/PNUTS Reliably replicate data and updates between regions Reliably communicate coordination/synchronization message between distributed actors Reliably log to-do actions for individual actors - 29 -
Quality of Service Hosted platform supporting multiple applications And eventually, multi-tenancy! Inter-application isolation Applications run on leased servers Performance is as good as those servers give you Unaffected by other applications Some shared infrastructure Overprovisioned to ensure performance agreements Intra-application isolation How to share my data without hurting my app s performance? Gold versus best-effort access Best-effort may be interrupted - 30 - to serve gold requests
BigTable BigTable overview Rows and columns abstraction with flexible schemas and data versioning, range scans Built on top of GFS Things BigTable emphasizes that we don t (for now, anyway) Keeping multiple versions Tight integration with MapReduce Things we emphasize that BigTable doesn t Asynchrony Geographic replication Indexing - 31 -
Dynamo Dynamo overview Highly write available data store Uses gossip and eventual consistency: can write anywhere, eventually update will propagate to all replicas PNUTS versus Dynamo Dynamo is a hash table; PNUTS is both hashed and ordered Eventual consistency model exposes dirty data PNUTS can operate in high availability or high consistency mode Gossip is not tuned for geographic replication No record structure or indexes in Dynamo - 32 -
Summary Hosted data management is a new frontier Beyond the issues we discussed, many novel aspects that arise because of hosting (e.g., multi-tenancy) Paradigm shift that goes beyond the technology (e.g., new kinds of usage, new business models) Formulas for new research problem: Old research problem + fine-grained asynchrony Old research problem + hosted service model Formulas for solutions? None so far, but lots of good ideas in the old solutions! - 33 -