Lecture 5 Distributed Database and BigTable

Similar documents
Big Table A Distributed Storage System For Data

Bigtable: A Distributed Storage System for Structured Data

Cloud Computing at Google. Architecture

Designs, Lessons and Advice from Building Large Distributed Systems. Jeff Dean Google Fellow

Hypertable Architecture Overview

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

A programming model in Cloud: MapReduce

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

The Google File System

Google File System. Web and scalability

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Open source large scale distributed data management with Google s MapReduce and Bigtable

Distributed File Systems

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Future Prospects of Scalable Cloud Computing

Big Data With Hadoop

Lecture 6 Cloud Application Development, using Google App Engine as an example

Data Management in the Cloud

Bigtable: A Distributed Storage System for Structured Data

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Bigdata High Availability (HA) Architecture

In Memory Accelerator for MongoDB

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Comparing SQL and NOSQL databases

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Distributed Systems. Tutorial 12 Cassandra

Parallel & Distributed Data Management

Data Centers and Cloud Computing. Data Centers

Bigtable is a proven design Underpins 100+ Google services:

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Challenges for Data Driven Systems

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Data Management in the Cloud -

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop IST 734 SS CHUNG

How To Improve Performance In A Database

FAWN - a Fast Array of Wimpy Nodes

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

CSE-E5430 Scalable Cloud Computing Lecture 2

Quantcast Petabyte Storage at Half Price with QFS!

Principles of Distributed Database Systems

The Hadoop Distributed File System

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Parallel Processing of cluster by Map Reduce

COS 318: Operating Systems

Practical Cassandra. Vitalii

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cassandra A Decentralized, Structured Storage System

Comparative analysis of Google File System and Hadoop Distributed File System

Raima Database Manager Version 14.0 In-memory Database Engine

Benchmarking Cassandra on Violin

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

NoSQL Data Base Basics

Design Patterns for Distributed Non-Relational Databases

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Bigdata : Enabling the Semantic Web at Web Scale

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

GraySort and MinuteSort at Yahoo on Hadoop 0.23

What is Analytic Infrastructure and Why Should You Care?

The Google File System

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Lecture Data Warehouse Systems

Parquet. Columnar storage for the people

Massive Data Storage

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

The Sierra Clustered Database Engine, the technology at the heart of

Oracle Database In-Memory The Next Big Thing

Binary search tree with SIMD bandwidth optimization using SSE

Hadoop Distributed File System (HDFS) Overview

Distributed Lucene : A distributed free text index for Hadoop

Couchbase Server Under the Hood

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Storage in Database Systems. CMPSCI 445 Fall 2010

Slave. Master. Research Scholar, Bharathiar University

Storing Data: Disks and Files

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Scala Storage Scale-Out Clustered Storage White Paper

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Hardware Configuration Guide

Transcription:

Lecture 5 Distributed Database and BigTable 922EU3870 Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/12) http://labs.google.com/papers/bigtable.html Ping Yeh ( 葉 平 ), Google, Inc.

Numbers real world engineers should know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1 KB with Zippy 10,000 ns Send 2 KB through 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within the same data center 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Round trip between California and Netherlands 150,000,000 ns 2

The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.

Overview of Distributed Database Silberschatz, Korth, Sudarsha, Database System Concepts, 4 th ed., McGraw Hill M. Tamer Özsu, DISTRIBUTED DATABASE SYSTEMS, http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.33.2276&rep=rep1&type=pdf 4

Transaction Glossary A unit of consistent and atomic execution against the database. Termination protocol A protocol by which individual sites can decide how to terminate a particular transaction when they cannot communicate with other sites where the transaction executes. Concurrency control algorithm, Distributed DBMS, Locking, Logging protocol, One-copy equivalence, Query processing, Query optimization, Quorum-based voting algorithm, Read-once, write-all protocol, Serializability, Transparency, Two-phase commit, Two-phase locking 5 5

Atomicity ACID properties of transactions either all the operations of a transaction are executed or none of them are (all-or-nothing). Consistency the database is in a legal state before and after a transaction Isolation the effects of one transaction on the database are isolated from other transactions until the first completes its execution. Durability the effects of successfully completed (i.e., committed) transactions endure subsequent failures. 6 6

Introduction Database Management System (DBMS) Distributed database system = distributed database + distributed DBMS Distributed database: a collection of multiple inter-correlated databases distributed over a computer network. Distributed DBMS: a DBMS that can manage a distributed database and make the distribution transparent to users. Consists of: query nodes: user interface routines data nodes: data storage Loosely coupled: connected with network, each node has its own storage / processor / operating system 7 7

Database System Architectures Centralized one host for everything, multi-processor possible but a transaction gets only one processor. Parallel a transaction may be processed by multiple processors. Client-Server database stored on one server host for multiple clients, centrally managed. Distributed database stored on multiple hosts, transparent to clients. Peer to Peer each node is a client and a server; requires sophisticated protocols, still in development. 8 8

Data Models Hierarchical Model: data organized in a tree namespace Network Model: like Hierarchical Model, but a data may have multiple parents Entity-Relationship Model: data are organized in entities which can have relationships among them Object-Oriented Model: database capability in an objectoriented language. Semi-structured Model: schema is contained in data (often associated with self-describing and XML ) etc. 9 9

Data distribution Data is physically distributed among data nodes Fragmentation: divide data onto data nodes Replication: copy data among data nodes Fragmentation enables placing data close to clients may reduce size of data involved may reduce transmission cost Replication preferable when the same data are accessed from applications that run at multiple nodes may be more cost-effective to duplicate data at multiple nodes rather than continuously moving it between them Many different schemes of fragmentation and replication 10 10

Fragmentation Horizontal fragmentation: split by rows based on a fragmentation predicate. Last name First name Department ID Chang Three Computer Science X12045 Lee Four Law Y34098 Chang Frank Medicine Z99441 Wang Andy Medicine S94717 Vertical fragmentation: split by columns based on attributes. Last name First name Department ID Chang Three Computer Science X12045 Lee Four Law Y34098 Chang Frank Medicine Z99441 Wang Andy Medicine S94717 Also called partition in some literature. 11 11

Other properties of Distributed Databases Concurrency control Make sure the distributed database is in a consistent state after a transaction Reliability protocols Make sure termination of transactions in the face of failures (system failure, storage failure, lost message, network partition, etc) One copy equivalence The same data item in all replicas must be the same 12 12

Query Optimization Looking for the best execution strategy for a given query Typically done in 4 steps query decomposition: translate query to relational algebra (for relational database) and analyze/simplify it data localization: decide which fragments are involved and generate local queries to fragments global optimization: finding the best execution strategy of queries and messages to fragments local optimization: optimize the query at a node for a fragment sophisticated topic 13 13

B+ Tree A data structure often used for indices of file systems or databases Data are indexed by keys Leaf nodes store data, nodes in interim levels store links Usually use one disk block per node to reduce disk seeks Except the root node, number of links or data in each node is bounded in [d/2, d]. d = order of the B+ tree, typically large. 8501780 29 57 850 879 910 1780 1 4 29 30 33 57 d 1 d 4 d 5 d 30 d 33 d 34 d 822 d 823 d 825 822 823 850 14 14

Insertion to a B+ tree InsertToTree(key, value, bplus_tree): node = Find(key, bplus_tree) # find the node to insert Insert(key, value, node) # insert Insert(key, value, node): AddData(key, value, node) if Size(node) > d: new_node = Split(node) # node -> new_node + node Insert(new_node.lastkey(), new_node, parent) May produce a new root node (not shown) 1 4 29 30 15 15

Deletion in a B+ tree DeleteInTree(key, bplus_tree): node = Find(key, bplus_tree) # find the node to delete if not node: return False Delete(key, node) # delete return True Delete(key, node): RemoveData(key, node) if Size(node) < d/2: RedistributeOrMerge(node, parent) 8501780 29 57 850 879 1 4 29 30 33 57 822 823 850 d d 30 d 33 d 34 d 822 d 823 d 1 d 4 d 5 825 16 16

Features of a B+ Tree Good fit for sorted data stored in block storage devices Fast search: O(log d N) with large d Fast range scan with links from one leaf node to the next: O(log d N+k) where k = number of elements Insertion may cause splitting of nodes Deletion may cause merge of nodes Many optimizations exist (with pros vs. cons) data structure of a node (array, binary tree, linked list, etc) compression of keys in a node lazy deletion RAM resident etc 17 17

Compressing Data in a B+ Tree How to use less space in nodes? Compressing all keys together most space efficient reading 10 bytes requires uncompressing the whole node Split the keys into blocks and compress each block less space efficient faster in small reads 18 18

BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI 2006 http://labs.google.com/papers/bigtable.html

Motivation Lots of (semi-)structured data at Google Web: contents, crawl metadata, links/anchors/pagerank, Per-user data: user preference settings, recent queries, search results, Geographic locations: physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, Scale is large billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data Need data both for offline data processing and online serving 20 20

Why not use a commercial DB? Scale is too large for most commercial databases Even if it weren t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems :) 21 21

Goals Wide applicability by many Google products and projects Often want to examine data changes over time, e.g., Contents of a web page over multiple crawls both throughput-oriented batch-processing jobs and latencysensitive serving of data to end users Scalability Handful to thousands of servers, hundreds of TB to PB High performance Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data High availability Want access to most current data at any time 22 22

BigTable Distributed multi-level map With an interesting data model Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 23 23

Status Design/initial implementation started beginning of 2004 Production use or active development for many projects: Google Analytics Personal Search History Crawling/indexing pipeline Google Maps/Google Earth Blogger ~100 BigTable cell with largest cell manages ~200TB of data spread over several thousand machines circa 2007 24 24

Building Blocks of BigTable Distributed File System (GFS): stores persistent state Scheduler (not published): schedules jobs onto machines BigTable jobs are among all kinds of jobs Lock service (Chubby): distributed lock manager Also can reliably hold small files with high availability Master election, location bootstrapping Data processing (MapReduce): Simplified large-scale data processing Often used to read/write BigTable data (not a building block of BigTable, but uses BigTable heavily) 25 25

Google File System (GFS) Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks triplicated across three machines for safety See SOSP 03 paper at http://labs.google.com/papers/gfs.html master chunk server chunk server chunk server client client client 26

Chubby Distributed lock service with a file system for small files Usually have 5 servers running paxos algorithm maintain consistency fault-tolerant master election event notification mechanism Also used for name resolution in the cluster 27

Master Key Jobs in a BigTable Cluster schedules tablets assignments quota management health check of tablet servers garbage collection management Tablet servers serve data for reads and writes (one tablet is assigned to exactly one tablet server) compaction replication etc monitor 28 28

Typical Cluster Cluster scheduling master Lock service GFS master Machine 1 Machine 2 Machine N User app1 User app2 BigTable server User app1 BigTable server BigTable master Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Scheduler slave GFS chunkserver Linux Linux Linux 29 29

BigTable Overview Data Model Implementation Structure Tablets, compactions, locality groups, API Details Shared logs, compression, replication, Current/Future Work 30 30

Basic Data Model Semi-structured: multi-dimensional sparse map (row, column, timestamp) cell contents Columns contents: inlinks: Rows com.cnn.www <html> t 3 t 11 t 17 Timestamps Good match for most of Google's applications 31 31

Everything is a string Every row has a single key Rows An arbitrary string (how about numerical keys?) Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically by key Rows close together lexicographically usually on one or a small number of machines Question: key distribution? Hot rows? No such things as empty row (see Columns page) 32 32

Arbitrary number of columns Columns organized into column families, then locality groups data in the same locality group are stored together (more later) Don't predefine columns (compare: schema) multi-map, not table. column names are arbitrary strings. sparse: a row contains only the columns that have data 33 33

Column Family Must be created before any column in the family can be written Has a type: string, protocol buffer, Basic unit of access control and usage accounting: different applications need access to different column families. careful with sensitive data A column key is named as family:qualifier family: printable; qualifier: any string. usually not a lot of column families in a BigTable cluster (hundreds) one anchor: column family for all anchors of incoming links but unlimited columns for each column family columns: anchor:cnn.com, anchor:news.yahoo.com, anchor:someone.blogger.com, 34 34

Reading BigTable operations selection by a combination of row, column or timestamp ranges Writing Write to individual cell versions (row, column, timestamp) Delete different granularities up to row Applied atomicity within a row 35 35

Read API Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row (Lookup), all rows, etc. Can ask for all columns, just certain column families, specific columns, timestamp ranges (ScanStream) Scanner scanner(t); ScanStream *stream; stream = scanner.fetchcolumnfamily("anchor"); stream->setreturnallversions(); scanner.lookup("com.cnn.www"); for (;!stream->done(); stream->next()) { } printf("%s %s %lld %s\n", scanner.rowname(), stream->columnname(), stream->microtimestamp(), stream->value()); 36 36

Metadata operations Write API Create/delete tables, column families, change metadata Row mutation Apply: single row only, atomic, sequence of sets and deletes APIs exist for bulk updates: updates are grouped and sent with one RPC call. Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(t, "com.cnn.www"); r1.set("anchor:www.c-span.org", "CNN"); r1.delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); 37 37

Tablets Large tables broken into tablets at row boundaries Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet from failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 38 38

Tablets Dynamic fragmentation of rows Unit of load balancing Distributed over tablet servers Tablets split and merge automatically based on size and load or manually Clients can choose row keys to achieve locality 39 39

Tablets & Splitting language: contents: aaa.com cnn.com EN <html> cnn.com/sports.html Tablets website.com yahoo.com/kids.html yahoo.com/kids.html\0 zuppa.com/menu.html 40 40

Locality Groups Dynamic fragmentation of column families segregates data within a tablet different locality groups different SSTable files on GFS scans over one locality group are O(bytes_in_locality_group), not O(bytes_in_table) Provides control over storage layout memory mapping of locality groups choice of compression algorithms client-controlled block size 41 41

Locality Groups contents: Locality Groups language: pagerank: www.cnn.com <html> EN 0.65 42 42

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: Return most recent K values Return all values in timestamp range (or all values) Column familes can be marked w/ attributes: Only retain most recent K values in a cell Keep values until they are older than K seconds 43 43

Where is my Tablets? Tablets move around from one tablet server to another (why?) Question: given a row, how does a client find the right tablet server? Tablet server location is ip:port Need to find tablet whose row range covers the target row One approach: could use the BigTable master Central server almost certainly would be bottleneck in large system Instead: store tablet location info in special tablets similar to a B+ tree 44 44

Metadata Tablets Approach: 3-level B+-tree like scheme for tablets 1st level: Chubby, points to MD0 (root) 2nd level: MD0 data points to appropriate METADATA tablet 3rd level: METADATA tablets point to data tablets METADATA tablets can be split when necessary MD0 never splits so number of levels is fixed MD0 45 45

Finding Tablet Location Client caches tablet locations. In case if it does not know, it has to make three network round-trips in case cache is empty and up to six round trips in case cache is stale. Tablet locations are stored in memory, so no GFS accesses are required 46

Tablet Storage Commit log on GFS Redo log buffered in tablet server's memory A set of locality groups one locality group = a set of SSTable files on GFS key = <row, column, timestamp>, value = cell content 47

SSTable SSTable: string to string table. persistent, ordered, immutable map from keys to values. keys and values are arbitrary byte strings. contains a sequence of blocks (typical size = 64KB), with a block index at the end of SSTable loaded at open time. one disk seek per block read. operations: lookup(key), iterate(key_range). an SSTable can be mapped into memory. 48

Tablet Serving read Memory memtable (random-access) minor compaction append-only log on GFS minor compaction write SSTable on GFS SSTable on GFS Tablet SSTable: Immutable on-disk ordered map from string->string string keys: <row, column, timestamp> triples 49 49

Compactions Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory) Minor compaction: When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS Major compaction: Periodically compact all SSTables for tablet into new base SSTable on GFS Storage reclaimed from deletions at this point (garbage collection) 50 50

System Structure Bigtable Cell metadata ops Bigtable client Bigtable client library Bigtable master performs metadata ops + load balancing read/write Open() Bigtable tablet server Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system handles failover, monitoring GFS holds tablet data, logs Lock service holds metadata, handles master-election 51 51

File Cleaning BigTable generates a lot of files dominated by SSTables SSTables are immutable: they can be created, read, or deleted, but not overwritten. Obsolete SSTables are deleted in a mark-and-sweep garbage collection run by the BigTable master 52 52

Chubby Interactions Master election: single Chubby lock Tablet server membership a tablet server creates and acquires an exclusive lock on a uniquely-named file in the servers directory of Chubby when it starts, and stops serving when the lock is lost. master monitors the directory to find tablet servers Chubby stores access control list Metadata Schema information (column family metadata) Tablet advertisement and metadata Replication metadata 53 53

Shared Logs Designed for 1M tablets, 1000s of tablet servers 1M logs being simultaneously written performs badly Solution: shared logs Write log file per tablet server instead of per tablet Updates for many tablets co-mingled in same file Start new log chunks every so often (64 MB) Problem: during recovery, server needs to read log data to apply mutations for a tablet Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk 54 54

Recovery: Shared Log Recovery Servers inform master of log chunks they need to read Master aggregates and orchestrates sorting of needed chunks Assigns log chunks to be sorted to different tablet servers Servers sort chunks by tablet, writes sorted data to local disk Other tablet servers ask master which servers have sorted chunks they need Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets 55 55

Keys: BigTable Compression Sorted strings of (Row, Column, Timestamp): prefix compression Values: Group together values by type (e.g. column family name) BMDiff across all values in one family BMDiff output for values 1..N is dictionary for value N+1 Zippy as final pass over whole block Catches more localized repetitions Also catches cross-column-family repetition, compresses keys 56 56

Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Within each SSTable for a locality group, encode compressed blocks Keep blocks small for random access (~64KB compressed data) Exploit fact that many values very similar Needs to be low CPU cost for encoding/decoding Two building blocks: BMDiff, Zippy 57 57

BMDiff Bentley, McIlroy DCC'99: Data Compression Using Long Common Strings Input: dictionary + source Output: sequence of COPY: <x> bytes from offset <y> LITERAL: <literal text> Store hash at every 32-byte aligned boundary in dictionary and source processed so far For every new source byte Compute incremental hash of last 32 bytes, lookup hash table On hit, expand match forwards & backwards, emit COPY Encode: ~ 100 MB/s, Decode: ~1000 MB/s 58 58

Zippy LZW-like: Store hash of last four bytes in 16K entry table For every input byte: Compute hash of last four bytes Lookup in table Emit COPY or LITERAL Differences from BMDiff: Much smaller compression window (local repetitions) Hash table is not associative Careful encoding of COPY/LITERAL tags and lengths Sloppy but fast: Algorithm % remaining Encoding Decoding Gzip 13.4% 21 MB/s 118 MB/s LZO 20.5% 135 MB/s 410 MB/s Zippy 22.2% 172 MB/s 409 MB/s 59 59

Compression Effectiveness Experiment: store contents for 2.1B page crawl in BigTable instance Key: URL rearranged as com.cnn.www/index.html:http Groups pages from same site together Good for compression Good for clients: efficient to scan over all pages on a web site One compression strategy: gzip each page: ~28% bytes remaining BigTable: BMDiff + Zippy: Type Count (B) Space (TB) Compressed % remaining Web page contents 2.1 45.1 TB 4.2 TB 9.2% Links 1.8 11.2 TB 1.6 TB 13.9% Anchors 126.3 22.8 TB 2.9 TB 12.7% 60 60

Bloom Filters A read may need to read many SSTables Idea: use a membership test to remove disk reads for non-existing data membership test: does (row,column) exist in the tablet? Algorithm: Bloom filter No false negatives. False positives: read to find out Update bit vector when new data is inserted. Delete? data in the set: {a 1, a 2, a N } indep. hash functions h 1, h 2, h k v m positions 0 0 0 0 01 0 01 0 0 01 0 01 0 0 0 0 01 0 0 0 0 0 0 01 0 01 0 0 0 01 0 0 query: b 61 61

Replication Often want updates replicated to many BigTable cells in different datacenters Low-latency access from anywhere in world Disaster tolerance Optimistic replication scheme Writes in any of the on-line replicas eventually propagated to other replica clusters 99.9% of writes replicated immediately (speed of light) Currently a thin layer above BigTable client library Working to move support inside BigTable system Replication deployed on My Search History 62 62

Performance 63 63

Application: Personalized Search Personalized search (http://www.google.com/psearch) an opt-in service records queries and clicks of a user in Google (web search, image search, news, etc) user can edit the search history search history affects search results Implementation in BigTable one user per row, row name = user ID one column family per action analyzed with MapReduce to produce user profile other products add column families later, quota system 64 64

Sample Usages 65 65

In Development/Future Plans More expressive data manipulation/access Allow sending small scripts to perform read/modify/write transactions so that they execute on server (kind of stored procedures ) Multi-row (i.e. distributed) transaction support General performance work for very large cells BigTable as a service Interesting issues of resource fairness, performance isolation, prioritization, etc. across different clients App Engine's DataStore 66 66

Conclusions Data model applicable to broad range of clients Actively deployed in many of Google s services System provides high performance storage system on a large scale Self-managing Thousands of servers Millions of ops/second Multiple GB/s reading/writing 67 67