Cloud Data Yahoo!

Transcription

1 Cloud Data Yahoo! Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Brian Cooper Adam Silberstein Yahoo! Research Joint work with the Sherpa team in Cloud Computing 1

2 Cloud Use Cases Content Optimization Machine Learning (e.g. Spam filters) Search Index Ads Optimization Attachment Storage Image/Video Storage & Delivery 2

3 Yahoo! Cloud Stack EDGE YCS Horizontal YCPI Cloud Services Brooklyn Provisioning (Self-serve) Monitoring/Metering/Security VM/OS VM/OS WEB Horizontal yapache Cloud ServicesPHP APP Horizontal Serving Cloud Grid Services OPERATIONAL STORAGE PNUTS/Sherpa Horizontal MOBStor Cloud Services App Engine Data Highway Hadoop BATCH STORAGE Horizontal Cloud Services 3

4 Requirements for Cloud Services Multitenant. A cloud service must support multiple, organizationally distant customers. Elasticity. Tenants should be able to negotiate and receive resources/qos on-demand up to a large scale. Resource Sharing. Ideally, spare cloud resources should be transparently applied when a tenant s negotiated QoS is insufficient, e.g., due to spikes. Horizontal scaling. The cloud provider should be able to add cloud capacity in increments without affecting tenants of the service. Metering. A cloud service must support accounting that reasonably ascribes operational and capital expenditures to each of the tenants of the service. Security. A cloud service should be secure in that tenants are not made vulnerable because of loopholes in the cloud. Availability. A cloud service should be highly available. Operability. A cloud service should be easy to operate, with few operators. Operating costs should scale linearly or better with the capacity of the service. Will discuss today 4

5 Web Data Management Warehousing, ML / data mining Scan oriented workloads Focus on sequential disk I/O $ per cpu cycle Large data analysis (Hadoop/Pig/Zebra) Blob storage (MObStor) Structured record storage (PNUTS/Sherpa) Object retrieval and streaming Scalable file storage $ per GB storage & bandwidth CRUD Point lookups and short scans Index organized table and random I/Os $ per latency 6

6 ACID or BASE? Litmus tests are colorful, but the picture is cloudy SCALABLE DATA SERVING 8

7 Yahoo! Serving Storage Problem Small records 100KB or less Structured records Lots of fields, evolving Extreme data scale - Tens of TB Extreme request scale - Tens of thousands of requests/sec Low latency globally datacenters worldwide High Availability - Outages cost $millions Variable usage patterns - Applications and users change 9 9

8 Typical Y! Applications User logins and profiles Including changes that must not be lost! Events But single-record transactions suffice Alerts (e.g., news, price drops) Social network activity (e.g., user goes offline) Ad clicks, article clicks Application-specific data Postings in message board Uploaded photos, tags Shopping carts 500M+ unique users per month Hundreds of petabytes of storage Hundreds of billions of objects Hundred of thousands of requests/sec Global, rapidly evolving workloads 10

9 VLSD Data Serving Stores Must partition data across machines How are partitions determined? Can partitions be changed easily? Affects elasticity, resource sharing, operability How are read/update requests routed? Range selections? Can requests span machines? Availability: What failures are handled? With what semantic guarantees on data access? (How) Is data replicated? Sync or async? Consistency model? Local or geo? How are updates made durable? How is data stored on a single machine? 11

10 PNUTS/SHERPA 12 12

11 The World Has Changed Web serving applications need: Scalability! Elastic on demand, commodity boxes Flexible schemas Geographic distribution/replication High availability Low latency Web serving applications willing to do without: Complex queries ACID transactions But still benefit from support for data consistency 13

12 What is PNUTS/Sherpa? A E B W C W D E E C F E CREATE TABLE Parts ( ID VARCHAR, StockNumberINT, Status VARCHAR ) Structured, flexible schema A E B W C W D E E C F E Parallel database A E B W C W D E E C F E Geographic replication Hosted, managed infrastructure 14 14

13 Architecture Local region Remote regions Clients REST API Routers Tablet Controller Tribble Storage units 15 15

14 PNUTS: Key Components Caches the maps from the TC Routes client requests to correct SU Maintains map from database.table.key-totablet-to-su Provides load balancing Stores records in tablets Services get/set/delete requests 16 16

15 Tablets Hash Table 0x0000 Name Description Price Grape Grapes are good to eat $12 Lime Limes are green $9 Apple Apple is wisdom $1 0x2AF3 Strawberry Orange Strawberry shortcake Arrgh! Don t get scurvy! $900 $2 Avocado But at what price? $3 Lemon How much did you pay for this lemon? $1 0x911F Tomato Banana Is this a vegetable? The perfect fruit $14 $2 0xFFFF Kiwi New Zealand $

16 Tablets Ordered Table A Name Description Price Apple Apple is wisdom $1 Avocado But at what price? $3 Banana The perfect fruit $2 H Grape Kiwi Grapes are good to eat New Zealand $12 $8 Lemon How much did you pay for this lemon? $1 Lime Limes are green $9 Q Orange Strawberry Arrgh! Don t get scurvy! Strawberry shortcake $2 $900 Z Tomato Is this a vegetable? $

17 Flexible Schema Posted date Listing id Item Price 6/1/ Couch $570 6/1/ Bike $86 6/3/ Car $1123 6/5/ Lamp $15 Color Red Condition Good Fair 19

18 PROCESSING READS & UPDATES 20 20

19 Updates 8 Sequence # for key k 1 Write key k Routers Message brokers 3 Write key k 7 Sequence # for key k 2 Write key k 4 SU SU SU 5 SUCCESS 6 Write key k 21 21

20 Accessing Data 4 Record for key k 1 Get key k 3 Record for key k 2 Get key k SU SU SU 22 22

21 Range Queries in YDOT Clustered, ordered retrieval of records Apple Avocado Banana Blueberry Grapefruit Pear? Canteloupe Grape Kiwi Lemon Lime Mango Orange Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Router Lime Pear? Grapefruit Lime? Apple Strawberry Avocado Tomato Banana Watermelon Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon Storage unit 1 Storage unit 2 Storage unit 3 23

22 Bulk Load in YDOT YDOT bulk inserts can cause performance hotspots Solution: preallocate tablets 24

23 ELASTICITY, OPERABILITY, HORIZONTAL SCALING 25 25

24 Distribution 6/1/ /1/ Couch $570 Data Distribution shuffling for load parallelism balancing Car $1123 6/2/ Bike $86 6/5/ Chair $10 6/7/ Lamp $19 6/9/ Bike $56 6/11/ Scooter $18 6/11/ Hammer $8000 Server 1 Server 2 Server 3 Server

25 Tablet Splitting and Balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Storage unit Tablet Overfull tablets split Tablets may grow over time Shed load by moving tablets to other servers 27 27

26 ASYNCHRONOUS REPLICATION AND CONSISTENCY 28 28

27 Asynchronous Replication 29 29

28 Consistency Levels ACID consistency Transaction: Add Brian as Toby s friend, and Toby as Brian s friend Brian s Friends Brad PPSN Adam Utkarsh Toby s Friends Carol Ari Shelton Jay Brian s Friends Brad PPSN Adam Utkarsh Toby Toby s Friends Carol Ari Shelton Jay Brian States you will never see: Brian is Toby s friend but Toby is not Brian s friend Toby is Brian s friend but Brian is not Toby s friend 30

29 The CAP Theorem You have to give up one of the following in a distributed system (Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002): Consistency of data Think serializability Availability Pinging a live node should produce results Partition tolerance Live nodes should not be blocked by partitions 31

30 BASE Approaches to CAP No ACID, use a single version of DB, reconcile later Defer transaction commit Until partitions fixed and distr xact can run Eventual consistency (e.g., Amazon Dynamo) Eventually, all copies of an object converge Restrict transactions (e.g., Sharded MySQL) 1-M/c Xacts: Objects in xact are on the same machine 1-Object Xacts: Xact can only read/write 1 object Object timelines (PNUTS) 32

31 Consistency: Social Alice West East Record Timeline User Alice Status Busy Network disruption: Alice redirected to East Busy User Status User Alice Status Free Free Alice Busy busy free User Status Alice??? User Status Alice??? Free 33

32 PNUTS Consistency Model Goal: Make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key Alice? Record inserted Update Update Update Update Update Update Update Delete v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time As the record is updated, copies may get out of sync

33 PNUTS Consistency Model Read Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time In general, reads are served using a local copy 35 35

34 PNUTS Consistency Model Read up-to-date Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time But application can request and get current version 36 36

35 PNUTS Consistency Model Read v.6 Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time Or variations such as read forward while copies may lag the master record, every copy goes through the same sequence of changes 37 37

36 PNUTS Consistency Model Write Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time Achieved via per-record primary copy protocol (To maximize availability, record masterships automatically transferred if site fails) Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors) 38 38

37 PNUTS Consistency Model Write if = v.7 ERROR Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Generation 1 Time Test-and-set writes facilitate per-record transactions 39 39

38 Consistency Techniques Per-record mastering Each record is assigned a master region May differ between records Updates to the record forwarded to the master region Ensures consistent ordering of updates Tablet-level mastering Each tablet is assigned a master region Inserts and deletes of records forwarded to the master region Master region decides tablet splits These details are hidden from the application Except for the latency impact! 43

39 Record Master A E B WE C W D E E C F E A E B W C W D E E C F E A E B WE C W D E E C F E 44 44

40 Tablet Master Key2: Region W Key E Key W Key E Key C Key E Region C Tablet master Key E Key W Key E Key C Key E Region E Key E Key W Key E Key C Key E 45 45

41 Tablet Mastership Tablet master Region W Region C Region E Key E Key E Key E Key W Key E Step 1: Forward Req to Tablet Master Key W Key E Key W Key E Key C Key C Key C Key E Step 2: Apply Insert to Tablet Master Key E Key E Key E Key W Key W Key E Key W Key W Key E Step 4: Apply Insert at Rec Master Key E Key C Key E Step 3: Replicate Insert to Other Sites Key E Key W Key W Key E Key C Key C Key E Key E 46 46

42 Write Interaction Consistency on (default) Insert: primary keys guaranteed to be unique Across all data replicas Update: updates are applied to a record in a given order Same order for all replicas Delete: deleted records will not re-appear, and records will not disappear without a delete Consistency off Above guarantees may not be enforced This decision is at the table level Once you do some consistency off stuff, cannot guarantee any consistency 47 47

43 Read Interaction Asynchronous replication means some data will be (temporarily) out of date All replicas of a record experience the same history of updates No guarantees offered for cross-record or whole-database consistency Read levels (assuming consistency on) decided per call: Read any Give me any available version of the record from that record s history Critical read Give me any version of the record, but no older than my last update Read forward Give me any version of the record, but no older than my last read Up to date Give me the most recent version of the record Though new updates may occur while the result is in transit back to me 48 48

44 Summary: Consistency Sherpa offers timeline and eventual consistency Can get per-record ACID using test-and-set Other systems offer other choices ACID Per-record ACID Eventual Other Oracle Weaker serializable levels Weaker serializable levels Supports quorum read/writes HBase Best effort Vertica HDFS Does not allow updates UDB Best-effort only 49

45 AVAILABILITY 50 50

46 Coping With Failures X X A E B W C W D E E C OVERRIDE W E F E A E B W C W D E E C F E A E B W C W D E E C F E 51 51

47 Possible Failure Modes Failure type Storage unit Consistency impact None Availability impact Degraded service (forwards) for some data. Updates and inserts fail for some records Resolution If data not lost: Reboot machine If data lost: Copy lost tablets from a remote replica Issue overrides to restore availability Time to resolve If data lost, hours or less (depending on tablet size and colo location). If no data lost, minutes. X Storage units 52

48 Possible Failure Modes Failure type Router Consistency impact None Availability impact None Resolution Boot router Time to resolve Minutes X Routers 53

49 Possible Failure Modes Failure Tablet controller Consistency impact XTablet controller None Availability impact Some actions (e.g., tablet copy) will be blocked Resolution Start secondary controller Time to resolve Minutes 54

50 Possible Failure Modes Failure One msg hub node Consistency impact X Msg Hubs None Availability impact Writes fail for some records until a new secondary node takes over Resolution Create new primary or secondary for lost topics Time to resolve Minutes 55

51 Possible Failure Modes Failure Colo power outage or partition Consistency impact Option to allow relaxed consistency to improve availability Availability impact Some inserts, updates and deletes cannot succeed Some critical reads fail Option to allow updates to proceed in relaxed consistency mode Resolution Major overrides to force mastership transfer; discard conflicting updates Time to resolve Hours Tablet map Load balancer Server monitor Tablet controller SU API Clients Routers WS API Tribble Storage units 56

52 Further PNutty Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases (SIGMOD 2009) Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, O Reilly Media, 2009 Adaptively Parallelizing Distributed Range Queries (VLDB 2009) Ymir Vigfusson, Adam Silberstein, Brian Cooper, Rodrigo Fonseca 57

53 Green Apples and Red Apples COMPARING SOME CLOUD SERVING STORES 58

54 Help! I have a huge amount of data. What should I do with it? UDB Sherpa Dryad DB2 59

55 Serving Stores Many cloud DB and nosql systems out there PNUTS BigTable HBase, Hypertable, HTable Azure Cassandra Megastore Amazon Web Services S3, SimpleDB, EBS And more: CouchDB, MongoDB, Voldemort, etc. How do they compare? Feature tradeoffs Performance tradeoffs Not clear! 60

56 The Contestants Baseline: Sharded MySQL Horizontally partition data among MySQL servers PNUTS/Sherpa Yahoo! s cloud database Cassandra BigTable + Dynamo HBase BigTable + Hadoop 61

57 SHARDED MYSQL 62 62

58 Architecture Our own implementation of sharding Client Client Client Client Client Shard Server Shard Server Shard Server Shard Server MySQL MySQL MySQL MySQL 63

59 Pros Simple Infinitely scalable Low latency Geo-replication Pros and Cons Cons Not elastic (Resharding is hard) Poor support for load balancing Failover? (Adds complexity) Replication unreliable (Async log shipping) 64

60 Azure SDS Cloud of SQL Server instances App partitions data into instance-sized pieces Transactions and queries within an instance Data SDS Instance Storage Per-field indexing 65

61 Google MegaStore Transactions across entity groups Entity-group: hierarchically linked records Ramakris Ramakris.preferences Ramakris.posts Ramakris.posts.aug Can transactionally update multiple records within an entity group Records may be on different servers Use Paxos to ensure ACID, deal with server failures Can join records within an entity group Other details Built on top of BigTable Supports schemas, column partitioning, some indexing Phil Bernstein, 66

62 PNUTS 67 67

63 Pros and Cons Pros Reliable geo-replication Scalable consistency model Elastic scaling Easy load balancing Cons System complexity relative to sharded MySQL to support geo-replication, consistency, etc. Latency added by router 68

64 HBASE 69 69

65 Architecture Client Client Client Client Client REST API Java Client HBaseMaster HRegionServer HRegionServer HRegionServer HRegionServer Disk Disk Disk Disk 70

66 HRegion Server Records partitioned into MemStores Each MemStore contains many HFiles All writes to MemStore applied to single memcache Reads consult HFiles and memcache Memcaches flushed as HFiles (HDFS files) when full Compactions limit number of HFiles writes HRegionServer Memcache reads MemStore HFiles Flush to disk 71

67 Pros and Cons Pros High write throughput Elastic scaling Easy load balancing Column storage for OLAP workloads Cons Writes not immediately persisted to disk Reads cross multiple disk, memory locations No geo-replication Latency/bottleneck of HBaseMaster when using REST 72

68 CASSANDRA 73 73

69 Architecture Facebook s storage system BigTable data model Dynamo partitioning and consistency model Peer-to-peer architecture Client Client Client Client Client Cassandra node Cassandra node Cassandra node Cassandra node Disk Disk Disk Disk 74

70 Cassandra Server Writes go to log and memory table Periodically memory table merged with disk table Update Cassandra node RAM Memtable (later) Log Disk SSTable file 75

71 Pros and Cons Pros Elastic scalability Easy management Peer-to-peer configuration BigTable model is nice Flexible schema, column groups for partitioning, versioning, etc. Eventual consistency is scalable Cons Eventual consistency is hard to program against No built-in support for geo-replication Gossip can work, but is not really optimized for, cross-datacenter Load balancing? Consistent hashing limits options System complexity P2P systems are complex; have complex corner cases 76

72 NUMBERS Keep in mind that these systems are in active development, and these numbers represent just a current snapshot! 77 77

73 Benchmark tiers Tier 1 Performance For constant hardware, increase offered throughput until saturation Measure resulting latency/throughput curve Sizeup in Wisconsin benchmark terminology Tier 2 Scalability Scaleup Increase hardware, data size and workload proportionally. Measure latency; should be constant Elastic speedup Run workload against N servers; while workload is running att N+1th server; measure timeseries of latencies (should drop after adding server) 78

74 Test setup Setup Six server-class machines 8 cores (2 x quadcore) 2.5 GHz CPUs, 8 GB RAM, 6 x 146GB 15K RPM SAS drives in RAID 1+0, Gigabit ethernet, RHEL 4 Plus extra machines for clients, routers, controllers, etc. Cassandra (0.6.0-beta2 for range queries) HBase MySQL organized into a sharded configuration Sherpa 1.8 with MySQL No replication; force updates to disk (except HBase, which primarily commits to memory) Workloads 120 million 1 KB records = 20 GB per server Reads retrieve whole record; updates write a single field 100 or more client threads Caveats Write performance would be improved for Sherpa, sharded MySQL and Cassandra with a dedicated log disk We tuned each system as well as we knew how, with assistance from the teams of developers 79

75 Workload A Update heavy 50/50 Read/update Workload A - Read latency Workload A - Update latency Average read latency (ms) Update latency (ms) Throughput (ops/sec) Throughput (ops/sec) Cassandra Hbase Sherpa MySQL Cassandra Hbase Sherpa MySQL Comment: Cassandra is optimized for writes, and achieves higher throughput and lower latency. Sherpa and MySQL achieve roughly comparable performance, as both are limited by MySQL s capabilities. HBase has good write latency, because of commits to memory, and somewhat higher read latency, because of the need to reconstruct records. 80

76 95/5 Read/update Workload B Read heavy Workload B - Read latency Workload B - Update latency Average read latency (ms) Throughput (operations/sec) Average update latency (ms) Throughput (operations/sec) Cassandra HBase Sherpa MySQL Cassandra Hbase Sherpa MySQL Comment: Sherpa does very well here, with better read latency only one lookup into a B- tree is needed for reads, unlike log-structured systems where records must be reconstructed. Cassandra also performs well, matching Sherpa until high throughputs. HBase does well also, although read time is higher. 81

77 Workload E short scans Scans of records of size 1KB 120 Workload E - Scan latency Average scan latency (ms) Throughput (operations/sec) Hbase Sherpa Cassandra Comment: HBase and Sherpa are roughly equivalent for latency and peak throughput, even though HBase is meant for scans. Cassandra s performance is poor, but the development team notes that many optimizations still need to be done. 82

78 Workload E range size Vary size of range scans Range size versus latency (Workload E) Average range scan latency (ms) Max range size (records) Hbase Sherpa Comment: For small ranges, queries are similar to random lookups; Sherpa is efficient for random lookups and does well. As range increases, HBase begins to perform better since it is optimized for large scans 83

79 Scale-up Read heavy workload with varying hardware Read latency during scale-up Average read latency (ms) Number of servers Cassandra Hbase Sherpa Comment: Sherpa and Casandra scale well, with flat latency as system size increases. HBase is very unstable; 3 servers or less performs very poorly. 84

80 Elasticity Run a read-heavy workload on 2 servers; add a 4 th, then 5 th, then 6 th server. 140 Sherpa Elasticity - 5th to 6th server 120 Read latency (ms) Duration of test (min) Comment: Sherpa shows variance in response time as tablets are moving, but after the data moves, it settles into an average that is faster than before the sixth server was added. 85

81 Elasticity Run a read-heavy workload on 2 servers; add a 4 th, then 5 th, then 6 th server. Cassandra Elasticity 5 th to 6 th server Read latency (ms) Duration of test (min) Comment: Cassandra shows lots of latency variance as it moves data to the new server, and takes multiple hours to stabilize 86

82 Elasticity Run a read-heavy workload on 2 servers; add a 4 th, then 5 th, then 6 th server. 250 HBase Elasticity - 5th to 6th Server 200 Read latency (ms) Test duration (min) Comment: HBase shows a small latency bump as the cluster reconfigures. But data is not moved to the new server until a compaction is performed (not shown in the graph) 87

83 Cassandra vs 0.5 Workload B - Read heavy Average latency (ms) Throughput (operations/sec) Cas 0.5 Read Cas 0.5 Update Cas Read Cas Update 88

84 For more information Contact: Brian Cooper Detailed writeup of benchmark: Open source YCSB tool coming soon (watch nt/ycsb) 89

85 Ongoing and Future Work ACID xacts over entity groups, coupled with entity group timelines across replicas Entity groups limited to a single machine, e.g., all records belonging to a user Auto tuning Multitenancy with SLAs Materialized views Alternative storage unit architectures LSM files (Stasis) Hybrid OLAP/OLTP workloads? Leveraging large (aggregate) main memories 95

86 New in 2010! SIGMOD and SIGOPS are starting a new annual conference, to be co-located alternately with SIGMOD and SOSP: ACM Symposium on Cloud Computing (SoCC) PC Chairs: Surajit Chaudhuri & Mendel Rosenblum Steering committee: Phil Bernstein, Ken Birman, Joe Hellerstein, John Ousterhout, Raghu Ramakrishnan, Doug Terry, John Wilkes 96

87 QUESTIONS? 97 97