Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014
Scale, Security, Schema
Scale
to scale 1 - (vt) to change the size of something
let s scale the cluster up to twice the original size
to scale 2 - (vi) to function properly at a large scale
Accumulo scales
What is Large Scale?
Notebook Computer 16 GB DRAM 512 GB Flash Storage 2.3 GHz quad-core i7 CPU
Modern Server 100s of GB DRAM 10s of TB on disk 10s of cores
Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk
Data Composition Original Raw Derivative QFDs Indexes 180 135 90 45 0 January February March April
Accumulo Scales From GB to PB, Accumulo keeps two things low: Administrative effort Scan latency
Scan Latency 0.05 0.038 0.025 0.013 0 0 250 500 750 1000
Administrative Overhead Failed Machines Admin Intervention 12 9 6 3 0 0 250 500 750 1000
Accumulo Scales From GB to PB three things grow linearly: Total storage size Ingest Rate Concurrent scans
Ingest Benchmark Millions of entries per second 100 75 50 25 0 0 250 500 750 1000
AWB Benchmark http://sqrrl.com/media/accumulo-benchmark-10312013-1.pdf
1000 machines
100 M entries written per second
408 terabytes
7.56 trillion total entries
Graph Benchmark http://www.pdl.cmu.edu/sdi/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
1200 machines
4.4 trillion vertices
70.4 trillion edges
149 M edges traversed per second
1 petabyte
Graph Analysis Billions of Edges 10000 70,000 1,000 100 1 1.5 6.6 Twitter Yahoo! Facebook Accumulo
Accumulo is designed after Google s BigTable
BigTable powers hundreds of applications at Google
BigTable serves 2+ exabytes http://hbasecon.com/sessions/#session33
600 M queries per second organization wide
From 10 to 10,000
Starting with ten machines 10 1
One rack
1 TB RAM
10-100 TB Disk
Hardware failures rare
Test Application Designs
Designing Applications for Scale
Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts
Keys to Scaling Writes on all servers Few Scans
Hash / UUID Keys Key Value RowID Col Value usera:name Bob af362de4 Bob usera:age 43 usera:account $30 b23dc4be b98de2ff Annie Joe userb:name Annie c48e2ade $30 userb:age 32 userb:account $25 c7e43fb2 $25 d938ff3d 32 userc:name Joe e2e4dac4 59 userc:age 59 e98f2eab3 43 Uniform writes
Monitor Participating Tablet Servers MyTable Servers Hosted Tablets Ingest r1n1 1500 200k r1n2 1501 210k r2n1 1499 190k r2n2 1500 200k
Hash / UUID Keys RowID Col Value get(usera) af362de4 b23dc4be b98de2ff Bob Annie Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab3 43 3 x 1-entry scans on 3 servers
Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys
Group for Locality Key usera:name Value Bob RowID Col Value af362de4 name Annie usera:age 43 userb:name usera:account Annie $30 userb:age userb:name 32 Annie userc:name userb:age Fred 32 userc:age userb:account 29 $25 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob userd:name userc:name Joe e2e4dac4 age 43 userd:age userc:age 59 e2e4dac4 account $30 Still fairly uniform writes
Group for Locality RowID Col Value get(usera) af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server
Keys to Scaling Writes on all servers Few Scans Grouped Keys
Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 20140101 44 20140102 22 20140103 23 userb:age 32 userc:name Fred userc:age 29 userd:name Joe userd:age 59
Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 userc:age 29 userd:name Joe userd:age 59
Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 userc:age 20140106 29 27 userd:name 20140107 Joe 25 userd:age 20140108 59 17 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Always write to one server
No write parallelism
Temporal Keys RowID Col Value get(20140101 to 201404) 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Fetching ranges uses few scans
Keys to Scaling Writes on all servers Few Scans Temporal Keys
Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 0_20140101 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 32 1_20140102 22 userc:name Fred userc:age 29 userd:name Joe 2_20140103 23 userd:age 59 Uniform Writes
Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 0_20140101 44 0_20140104 25 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 1_20140102 22 1_20140105 31 userc:age 20140106 29 27 userd:name Joe 2_20140103 23 userd:age 59 2_20140106 27 Uniform Writes
Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 userc:age 20140106 29 27 userd:name 20140107 Joe 25 userd:age 20140108 59 17 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 Uniform Writes
Binned Temporal Keys get(20140101 to 201404) RowID Col Value 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 One scan per bin
Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys
Keys to Scaling Key design is critical Group data under common row IDs to reduce scans Prepend bins to row IDs to increase write parallelism
Splits Pre-split or organic splits Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system Hundreds or thousands of tablets per server is ok Want at least one tablet per server
Effect of Compression Similar sorted keys compress well May need more data than you think to auto-split
Inserts are fast 10s of thousands per second per machine
Updates *can* be
Update Types Overwrite Combine Complex
Update - Overwrite Performance same as insert Ignore (don t read) existing value Accumulo s Versioning Iterator does the overwrite
Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Combine Things like X = X + 1 Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction Performance is same as inserts
Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 getaccount(userb) $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 After compaction c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Complex Some updates require looking at more data than Iterators have access to - such as multiple rows These require reading the data out in order to write the new value Performance will be much slower
Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe 35+30 = 65 c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe 35+30 = 65 c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
Planning a Larger-Scale Cluster 10 2-10 4
Storage vs Ingest Ingest Rate 1x1TB 12x3TB Millions of Entries per second 1000000 1000 1 10 120,000 12,000 10,000 1,200 1,000 120 100 10 100 1000 10000 Storage Terabytes
Model for Ingest Rates N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) A = 0.85 log 2 N * N * S Expect 85% increase in write rate when doubling the size of the cluster
Estimating Machines Required N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) N = 2 (log (A/S) / 0.7655347) 2 Expect 85% increase in write rate when doubling the size of the cluster
Predicted Cluster Sizes 12000 Number of Machines 9000 6000 3000 0 0 150 300 450 600 Millions of Entries per Second
100 Machines 10 2
Multiple racks
10 TB RAM
100 TB - 1PB Disk
Some hardware failures in the first week (burn in)
Expect 3 failed HDs in first 3 mo
Another 4 within the first year http://static.googleusercontent.com/media/ research.google.com/en/us/archive/disk_failures.pdf
Can process the 1000 Genomes data set 260 TB www.1000genomes.org
Can store and index the Common Crawl Corpus commoncrawl.org! 2.8 Billion web pages 541 TB
One year of Twitter 182 trillion tweets 483 TB http://www.sec.gov/archives/edgar/data/ 1418091/000119312513390321/d564001ds1.htm
Deploying an Application Users Clients Tablet Servers
May not see the affect of writing to disk for a while
1000 machines 10 3
Multiple rows of racks
100 TB RAM
1-10 PB Disk
Hardware failure is a regular occurrence
Hard drive failure about every 5 days (average). Will be skewed towards beginning of! the year
Can traverse the brain graph 70 trillion edges, 1 PB http://www.pdl.cmu.edu/sdi/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
Facebook Graph 1s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf
Netflix Video Master Copies 3.14 PB http://www.businessweek.com/articles/2013-05-09/netflix-reedhastings-survive-missteps-to-join-silicon-valleys-elite
World of Warcraft Backend Storage 1.3 PB http://www.datacenterknowledge.com/archives/2009/11/25/ wows-back-end-10-data-centers-75000-cores/
Webpages, live on the Internet 14.3 Trillion http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html
Things like the difference between two compression algorithms start to make a big difference
Use range compactions to affect changes on portions of table
Lay off Zookeeper
Watch Garbage Collector and Namenode ops
Garbage Collection > 5 minutes?
Start thinking about NameNode Federation
Accumulo 1.6
Multiple NameNodes Accumulo Namenode Namenode DataNodes DataNodes Multiple HDFS Clusters
Multiple NameNodes Accumulo Namenode Namenode DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)
More Namenodes = higher risk of one going down.! Can use HA Namenodes in conjunction w/ Federation
10,000 machines 10 4
You, my friend, are here to kick a** and chew bubble gum
1 PB RAM
10-100 PB Disk
1 hardware failure every hour on average
Entire Internet Archive 15 PB http://www.motherjones.com/media/2014/05/ internet-archive-wayback-machine-brewster-kahle
A year s worth of data from the Large Hadron Collider 15 PB http://home.web.cern.ch/about/computing
0.1% of all Internet traffic in 2013 43.6 PB http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html
Facebook Messaging Data 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf
Facebook Photos 240 billion High 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf
Must use multiple NameNodes
Can tune back heartbeats, periodicity of central processes in general
Can combine multiple PB data sets
Up to 10 quadrillion entries in a single table
While maintaining sub-second lookup times
Only with Accumulo 1.6
Dealing with data over time
Data Over Time - Patterns Initial Load Increasing Velocity Focus on Recency Historical Summaries
Initial Load Get a pile of old data into Accumulo fast Latency not important (data is old) Throughput critical
Bulk Load RFiles
Bulk Loading MapReduce RFiles Accumulo
Increasing velocity
If your data isn t big today, wait a little while
Accumulo scales up dynamically, online. No downtime
The first scale, can change size
Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process
Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes
Scaling Up Clients Accumulo HDFS master immediately assigns tablets
Clients Scaling Up Clients immediately begin querying new Tablet Servers Accumulo HDFS
Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes
Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes
Never really seen anyone do this
Except myself
20 machines in Amazon EC2
to 400 machines
all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back
Scaled back down to 20 machines when done
Just killed Tablet Servers
Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes
Other ways to go from 10 x to 10 x+1
Accumulo Table Export
followed by HDFS DistCP to new cluster
Maybe new replication feature
Newer Data is Read more Often
Accumulo keeps newly written data in memory
Block Cache can keep recently queried data in memory
Combining Iterators make maintaining summaries of large amounts of raw events easy
Reduces storage burden
Historical Summaries Unique Entities Stored Raw Events Processed 8000 6000 4000 2000 0 April May June July
Age-off iterator can automatically remove data over a certain age
IBM estimates 2.5 exabytes of data is created every day http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html
90% of available data created in last 2 years http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html
25 new 10k node Accumulo clusters per day
Accumulo is doing it s part to get in front of the big data trend
Questions?
@aaroncordova