Lecture 5 Distributed Database and BigTable

Lecture 5 Distributed Database and BigTable 922EU3870 Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/12) http://labs.google.com/papers/bigtable.html Ping Yeh ( 葉平 ), Google, Inc.

Numbers real world engineers should know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1 KB with Zippy 10,000 ns Send 2 KB through 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within the same data center 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Round trip between California and Netherlands 150,000,000 ns 2

The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.

Overview of Distributed Database Silberschatz, Korth, Sudarsha, Database System Concepts, 4 th ed., McGraw Hill M. Tamer Özsu, DISTRIBUTED DATABASE SYSTEMS, http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.33.2276&rep=rep1&type=pdf 4

Transaction Glossary A unit of consistent and atomic execution against the database. Termination protocol A protocol by which individual sites can decide how to terminate a particular transaction when they cannot communicate with other sites where the transaction executes. Concurrency control algorithm, Distributed DBMS, Locking, Logging protocol, One-copy equivalence, Query processing, Query optimization, Quorum-based voting algorithm, Read-once, write-all protocol, Serializability, Transparency, Two-phase commit, Two-phase locking 5 5

Atomicity ACID properties of transactions either all the operations of a transaction are executed or none of them are (all-or-nothing). Consistency the database is in a legal state before and after a transaction Isolation the effects of one transaction on the database are isolated from other transactions until the first completes its execution. Durability the effects of successfully completed (i.e., committed) transactions endure subsequent failures. 6 6

Introduction Database Management System (DBMS) Distributed database system = distributed database + distributed DBMS Distributed database: a collection of multiple inter-correlated databases distributed over a computer network. Distributed DBMS: a DBMS that can manage a distributed database and make the distribution transparent to users. Consists of: query nodes: user interface routines data nodes: data storage Loosely coupled: connected with network, each node has its own storage / processor / operating system 7 7

Database System Architectures Centralized one host for everything, multi-processor possible but a transaction gets only one processor. Parallel a transaction may be processed by multiple processors. Client-Server database stored on one server host for multiple clients, centrally managed. Distributed database stored on multiple hosts, transparent to clients. Peer to Peer each node is a client and a server; requires sophisticated protocols, still in development. 8 8

Data Models Hierarchical Model: data organized in a tree namespace Network Model: like Hierarchical Model, but a data may have multiple parents Entity-Relationship Model: data are organized in entities which can have relationships among them Object-Oriented Model: database capability in an objectoriented language. Semi-structured Model: schema is contained in data (often associated with self-describing and XML ) etc. 9 9

Data distribution Data is physically distributed among data nodes Fragmentation: divide data onto data nodes Replication: copy data among data nodes Fragmentation enables placing data close to clients may reduce size of data involved may reduce transmission cost Replication preferable when the same data are accessed from applications that run at multiple nodes may be more cost-effective to duplicate data at multiple nodes rather than continuously moving it between them Many different schemes of fragmentation and replication 10 10

Fragmentation Horizontal fragmentation: split by rows based on a fragmentation predicate. Last name First name Department ID Chang Three Computer Science X12045 Lee Four Law Y34098 Chang Frank Medicine Z99441 Wang Andy Medicine S94717 Vertical fragmentation: split by columns based on attributes. Last name First name Department ID Chang Three Computer Science X12045 Lee Four Law Y34098 Chang Frank Medicine Z99441 Wang Andy Medicine S94717 Also called partition in some literature. 11 11

Other properties of Distributed Databases Concurrency control Make sure the distributed database is in a consistent state after a transaction Reliability protocols Make sure termination of transactions in the face of failures (system failure, storage failure, lost message, network partition, etc) One copy equivalence The same data item in all replicas must be the same 12 12

Query Optimization Looking for the best execution strategy for a given query Typically done in 4 steps query decomposition: translate query to relational algebra (for relational database) and analyze/simplify it data localization: decide which fragments are involved and generate local queries to fragments global optimization: finding the best execution strategy of queries and messages to fragments local optimization: optimize the query at a node for a fragment sophisticated topic 13 13

B+ Tree A data structure often used for indices of file systems or databases Data are indexed by keys Leaf nodes store data, nodes in interim levels store links Usually use one disk block per node to reduce disk seeks Except the root node, number of links or data in each node is bounded in [d/2, d]. d = order of the B+ tree, typically large. 8501780 29 57 850 879 910 1780 1 4 29 30 33 57 d 1 d 4 d 5 d 30 d 33 d 34 d 822 d 823 d 825 822 823 850 14 14

Insertion to a B+ tree InsertToTree(key, value, bplus_tree): node = Find(key, bplus_tree) # find the node to insert Insert(key, value, node) # insert Insert(key, value, node): AddData(key, value, node) if Size(node) > d: new_node = Split(node) # node -> new_node + node Insert(new_node.lastkey(), new_node, parent) May produce a new root node (not shown) 1 4 29 30 15 15

Deletion in a B+ tree DeleteInTree(key, bplus_tree): node = Find(key, bplus_tree) # find the node to delete if not node: return False Delete(key, node) # delete return True Delete(key, node): RemoveData(key, node) if Size(node) < d/2: RedistributeOrMerge(node, parent) 8501780 29 57 850 879 1 4 29 30 33 57 822 823 850 d d 30 d 33 d 34 d 822 d 823 d 1 d 4 d 5 825 16 16

Features of a B+ Tree Good fit for sorted data stored in block storage devices Fast search: O(log d N) with large d Fast range scan with links from one leaf node to the next: O(log d N+k) where k = number of elements Insertion may cause splitting of nodes Deletion may cause merge of nodes Many optimizations exist (with pros vs. cons) data structure of a node (array, binary tree, linked list, etc) compression of keys in a node lazy deletion RAM resident etc 17 17

Compressing Data in a B+ Tree How to use less space in nodes? Compressing all keys together most space efficient reading 10 bytes requires uncompressing the whole node Split the keys into blocks and compress each block less space efficient faster in small reads 18 18

BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI 2006 http://labs.google.com/papers/bigtable.html

Motivation Lots of (semi-)structured data at Google Web: contents, crawl metadata, links/anchors/pagerank, Per-user data: user preference settings, recent queries, search results, Geographic locations: physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, Scale is large billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data Need data both for offline data processing and online serving 20 20

Why not use a commercial DB? Scale is too large for most commercial databases Even if it weren t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems :) 21 21

Goals Wide applicability by many Google products and projects Often want to examine data changes over time, e.g., Contents of a web page over multiple crawls both throughput-oriented batch-processing jobs and latencysensitive serving of data to end users Scalability Handful to thousands of servers, hundreds of TB to PB High performance Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data High availability Want access to most current data at any time 22 22

BigTable Distributed multi-level map With an interesting data model Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 23 23

Status Design/initial implementation started beginning of 2004 Production use or active development for many projects: Google Analytics Personal Search History Crawling/indexing pipeline Google Maps/Google Earth Blogger ~100 BigTable cell with largest cell manages ~200TB of data spread over several thousand machines circa 2007 24 24

Building Blocks of BigTable Distributed File System (GFS): stores persistent state Scheduler (not published): schedules jobs onto machines BigTable jobs are among all kinds of jobs Lock service (Chubby): distributed lock manager Also can reliably hold small files with high availability Master election, location bootstrapping Data processing (MapReduce): Simplified large-scale data processing Often used to read/write BigTable data (not a building block of BigTable, but uses BigTable heavily) 25 25

Google File System (GFS) Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks triplicated across three machines for safety See SOSP 03 paper at http://labs.google.com/papers/gfs.html master chunk server chunk server chunk server client client client 26

Chubby Distributed lock service with a file system for small files Usually have 5 servers running paxos algorithm maintain consistency fault-tolerant master election event notification mechanism Also used for name resolution in the cluster 27

Master Key Jobs in a BigTable Cluster schedules tablets assignments quota management health check of tablet servers garbage collection management Tablet servers serve data for reads and writes (one tablet is assigned to exactly one tablet server) compaction replication etc monitor 28 28

Typical Cluster Cluster scheduling master Lock service GFS master Machine 1 Machine 2 Machine N User app1 User app2 BigTable server User app1 BigTable server BigTable master Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Linux Linux Linux 29 29

BigTable Overview Data Model Implementation Structure Tablets, compactions, locality groups, API Details Shared logs, compression, replication, Current/Future Work 30 30

Basic Data Model Semi-structured: multi-dimensional sparse map (row, column, timestamp) cell contents Columns contents: inlinks: Rows com.cnn.www <html> t 3 t 11 t 17 Timestamps Good match for most of Google's applications 31 31

Everything is a string Every row has a single key Rows An arbitrary string (how about numerical keys?) Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically by key Rows close together lexicographically usually on one or a small number of machines Question: key distribution? Hot rows? No such things as empty row (see Columns page) 32 32

Arbitrary number of columns Columns organized into column families, then locality groups data in the same locality group are stored together (more later) Don't predefine columns (compare: schema) multi-map, not table. column names are arbitrary strings. sparse: a row contains only the columns that have data 33 33

Column Family Must be created before any column in the family can be written Has a type: string, protocol buffer, Basic unit of access control and usage accounting: different applications need access to different column families. careful with sensitive data A column key is named as family:qualifier family: printable; qualifier: any string. usually not a lot of column families in a BigTable cluster (hundreds) one anchor: column family for all anchors of incoming links but unlimited columns for each column family columns: anchor:cnn.com, anchor:news.yahoo.com, anchor:someone.blogger.com, 34 34

Reading BigTable operations selection by a combination of row, column or timestamp ranges Writing Write to individual cell versions (row, column, timestamp) Delete different granularities up to row Applied atomicity within a row 35 35

Read API Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row (Lookup), all rows, etc. Can ask for all columns, just certain column families, specific columns, timestamp ranges (ScanStream) Scanner scanner(t); ScanStream *stream; stream = scanner.fetchcolumnfamily("anchor"); stream->setreturnallversions(); scanner.lookup("com.cnn.www"); for (;!stream->done(); stream->next()) { } printf("%s %s %lld %s\n", scanner.rowname(), stream->columnname(), stream->microtimestamp(), stream->value()); 36 36

Metadata operations Write API Create/delete tables, column families, change metadata Row mutation Apply: single row only, atomic, sequence of sets and deletes APIs exist for bulk updates: updates are grouped and sent with one RPC call. Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(t, "com.cnn.www"); r1.set("anchor:www.c-span.org", "CNN"); r1.delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); 37 37

Tablets Large tables broken into tablets at row boundaries Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet from failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 38 38

Tablets Dynamic fragmentation of rows Unit of load balancing Distributed over tablet servers Tablets split and merge automatically based on size and load or manually Clients can choose row keys to achieve locality 39 39

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com yahoo.com/kids.html yahoo.com/kids.html\0 zuppa.com/menu.html 40 40

Locality Groups Dynamic fragmentation of column families segregates data within a tablet different locality groups different SSTable files on GFS scans over one locality group are O(bytes_in_locality_group), not O(bytes_in_table) Provides control over storage layout memory mapping of locality groups choice of compression algorithms client-controlled block size 41 41

Locality Groups contents: Locality Groups language: pagerank: www.cnn.com <html> EN 0.65 42 42

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: Return most recent K values Return all values in timestamp range (or all values) Column familes can be marked w/ attributes: Only retain most recent K values in a cell Keep values until they are older than K seconds 43 43

Where is my Tablets? Tablets move around from one tablet server to another (why?) Question: given a row, how does a client find the right tablet server? Tablet server location is ip:port Need to find tablet whose row range covers the target row One approach: could use the BigTable master Central server almost certainly would be bottleneck in large system Instead: store tablet location info in special tablets similar to a B+ tree 44 44

Metadata Tablets Approach: 3-level B+-tree like scheme for tablets 1st level: Chubby, points to MD0 (root) 2nd level: MD0 data points to appropriate METADATA tablet 3rd level: METADATA tablets point to data tablets METADATA tablets can be split when necessary MD0 never splits so number of levels is fixed MD0 45 45

Finding Tablet Location Client caches tablet locations. In case if it does not know, it has to make three network round-trips in case cache is empty and up to six round trips in case cache is stale. Tablet locations are stored in memory, so no GFS accesses are required 46

Tablet Storage Commit log on GFS Redo log buffered in tablet server's memory A set of locality groups one locality group = a set of SSTable files on GFS key = <row, column, timestamp>, value = cell content 47

SSTable SSTable: string to string table. persistent, ordered, immutable map from keys to values. keys and values are arbitrary byte strings. contains a sequence of blocks (typical size = 64KB), with a block index at the end of SSTable loaded at open time. one disk seek per block read. operations: lookup(key), iterate(key_range). an SSTable can be mapped into memory. 48

Tablet Serving read Memory memtable (random-access) minor compaction append-only log on GFS minor compaction write SSTable on GFS SSTable on GFS Tablet SSTable: Immutable on-disk ordered map from string->string string keys: <row, column, timestamp> triples 49 49

Compactions Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory) Minor compaction: When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS Major compaction: Periodically compact all SSTables for tablet into new base SSTable on GFS Storage reclaimed from deletions at this point (garbage collection) 50 50

System Structure Bigtable Cell metadata ops Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing read/write Open() Bigtable tablet server Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system handles failover, monitoring GFS holds tablet data, logs Lock service holds metadata, handles master-election 51 51

File Cleaning BigTable generates a lot of files dominated by SSTables SSTables are immutable: they can be created, read, or deleted, but not overwritten. Obsolete SSTables are deleted in a mark-and-sweep garbage collection run by the BigTable master 52 52

Chubby Interactions Master election: single Chubby lock Tablet server membership a tablet server creates and acquires an exclusive lock on a uniquely-named file in the servers directory of Chubby when it starts, and stops serving when the lock is lost. master monitors the directory to find tablet servers Chubby stores access control list Metadata Schema information (column family metadata) Tablet advertisement and metadata Replication metadata 53 53

Shared Logs Designed for 1M tablets, 1000s of tablet servers 1M logs being simultaneously written performs badly Solution: shared logs Write log file per tablet server instead of per tablet Updates for many tablets co-mingled in same file Start new log chunks every so often (64 MB) Problem: during recovery, server needs to read log data to apply mutations for a tablet Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk 54 54

Recovery: Shared Log Recovery Servers inform master of log chunks they need to read Master aggregates and orchestrates sorting of needed chunks Assigns log chunks to be sorted to different tablet servers Servers sort chunks by tablet, writes sorted data to local disk Other tablet servers ask master which servers have sorted chunks they need Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets 55 55

Keys: BigTable Compression Sorted strings of (Row, Column, Timestamp): prefix compression Values: Group together values by type (e.g. column family name) BMDiff across all values in one family BMDiff output for values 1..N is dictionary for value N+1 Zippy as final pass over whole block Catches more localized repetitions Also catches cross-column-family repetition, compresses keys 56 56

Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Within each SSTable for a locality group, encode compressed blocks Keep blocks small for random access (~64KB compressed data) Exploit fact that many values very similar Needs to be low CPU cost for encoding/decoding Two building blocks: BMDiff, Zippy 57 57

BMDiff Bentley, McIlroy DCC'99: Data Compression Using Long Common Strings Input: dictionary + source Output: sequence of COPY: <x> bytes from offset <y> LITERAL: <literal text> Store hash at every 32-byte aligned boundary in dictionary and source processed so far For every new source byte Compute incremental hash of last 32 bytes, lookup hash table On hit, expand match forwards & backwards, emit COPY Encode: ~ 100 MB/s, Decode: ~1000 MB/s 58 58

Zippy LZW-like: Store hash of last four bytes in 16K entry table For every input byte: Compute hash of last four bytes Lookup in table Emit COPY or LITERAL Differences from BMDiff: Much smaller compression window (local repetitions) Hash table is not associative Careful encoding of COPY/LITERAL tags and lengths Sloppy but fast: Algorithm % remaining Encoding Decoding Gzip 13.4% 21 MB/s 118 MB/s LZO 20.5% 135 MB/s 410 MB/s Zippy 22.2% 172 MB/s 409 MB/s 59 59

Compression Effectiveness Experiment: store contents for 2.1B page crawl in BigTable instance Key: URL rearranged as com.cnn.www/index.html:http Groups pages from same site together Good for compression Good for clients: efficient to scan over all pages on a web site One compression strategy: gzip each page: ~28% bytes remaining BigTable: BMDiff + Zippy: Type Count (B) Space (TB) Compressed % remaining Web page contents 2.1 45.1 TB 4.2 TB 9.2% Links 1.8 11.2 TB 1.6 TB 13.9% Anchors 126.3 22.8 TB 2.9 TB 12.7% 60 60

Bloom Filters A read may need to read many SSTables Idea: use a membership test to remove disk reads for non-existing data membership test: does (row,column) exist in the tablet? Algorithm: Bloom filter No false negatives. False positives: read to find out Update bit vector when new data is inserted. Delete? data in the set: {a 1, a 2, a N } indep. hash functions h 1, h 2, h k v m positions 0 0 0 0 01 0 01 0 0 01 0 01 0 0 0 0 01 0 0 0 0 0 0 01 0 01 0 0 0 01 0 0 query: b 61 61

Replication Often want updates replicated to many BigTable cells in different datacenters Low-latency access from anywhere in world Disaster tolerance Optimistic replication scheme Writes in any of the on-line replicas eventually propagated to other replica clusters 99.9% of writes replicated immediately (speed of light) Currently a thin layer above BigTable client library Working to move support inside BigTable system Replication deployed on My Search History 62 62

Performance 63 63

Application: Personalized Search Personalized search (http://www.google.com/psearch) an opt-in service records queries and clicks of a user in Google (web search, image search, news, etc) user can edit the search history search history affects search results Implementation in BigTable one user per row, row name = user ID one column family per action analyzed with MapReduce to produce user profile other products add column families later, quota system 64 64

Sample Usages 65 65

In Development/Future Plans More expressive data manipulation/access Allow sending small scripts to perform read/modify/write transactions so that they execute on server (kind of stored procedures ) Multi-row (i.e. distributed) transaction support General performance work for very large cells BigTable as a service Interesting issues of resource fairness, performance isolation, prioritization, etc. across different clients App Engine's DataStore 66 66

Conclusions Data model applicable to broad range of clients Actively deployed in many of Google s services System provides high performance storage system on a large scale Self-managing Thousands of servers Millions of ops/second Multiple GB/s reading/writing 67 67