Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Similar documents

Open source Google-style large scale data analysis with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source large scale distributed data management with Google s MapReduce and Bigtable

Application Development. A Paradigm Shift

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MapReduce, Hadoop and Amazon AWS

Hadoop: Embracing future hardware

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Large-Scale Data Processing

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. Sunday, November 25, 12

Big Data With Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Large scale processing using Hadoop. Ján Vaňo

THE HADOOP DISTRIBUTED FILE SYSTEM

How To Scale Out Of A Nosql Database

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

From Internet Data Centers to Data Centers in the Cloud

Benchmarking Cassandra on Violin

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Graph Database Proof of Concept Report

CS54100: Database Systems

Apache Hadoop. Alexandru Costan

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop Size does Hadoop Summit 2013

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop & its Usage at Facebook

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Bigtable is a proven design Underpins 100+ Google services:

Benchmarking Hadoop & HBase on Violin

Apache HBase. Crazy dances on the elephant back

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Chapter 7. Using Hadoop Cluster and MapReduce

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Big Data Challenges in Bioinformatics

Hadoop Parallel Data Processing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Cloud Computing Where ISR Data Will Go for Exploitation

Design and Evolution of the Apache Hadoop File System(HDFS)

Cost-Effective Business Intelligence with Red Hat and Open Source

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Big Systems, Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Architecture. Part 1

Moving From Hadoop to Spark

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Graph Processing: Some Background

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

InfiniteGraph: The Distributed Graph Database

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Quantcast Petabyte Storage at Half Price with QFS!

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Intro to Map/Reduce a.k.a. Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Apache Hadoop s MapReduce

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Processing of Hadoop using Highly Available NameNode

Hadoop & its Usage at Facebook

A Performance Analysis of Distributed Indexing using Terrier

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

MapReduce and Hadoop Distributed File System

Extending Hadoop beyond MapReduce

MinCopysets: Derandomizing Replication In Cloud Storage

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce with Apache Hadoop Analysing Big Data

NextGen Infrastructure for Big DATA Analytics.

NoSQL Data Base Basics

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Snapshots in Hadoop Distributed File System

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Comparing Scalable NOSQL Databases

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CDH AND BUSINESS CONTINUITY:

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data Centric Computing Revisited

Transcription:

Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014

Scale, Security, Schema

Scale

to scale 1 - (vt) to change the size of something

let s scale the cluster up to twice the original size

to scale 2 - (vi) to function properly at a large scale

Accumulo scales

What is Large Scale?

Notebook Computer 16 GB DRAM 512 GB Flash Storage 2.3 GHz quad-core i7 CPU

Modern Server 100s of GB DRAM 10s of TB on disk 10s of cores

Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk

Data Composition Original Raw Derivative QFDs Indexes 180 135 90 45 0 January February March April

Accumulo Scales From GB to PB, Accumulo keeps two things low: Administrative effort Scan latency

Scan Latency 0.05 0.038 0.025 0.013 0 0 250 500 750 1000

Administrative Overhead Failed Machines Admin Intervention 12 9 6 3 0 0 250 500 750 1000

Accumulo Scales From GB to PB three things grow linearly: Total storage size Ingest Rate Concurrent scans

Ingest Benchmark Millions of entries per second 100 75 50 25 0 0 250 500 750 1000

AWB Benchmark http://sqrrl.com/media/accumulo-benchmark-10312013-1.pdf

1000 machines

100 M entries written per second

408 terabytes

7.56 trillion total entries

Graph Benchmark http://www.pdl.cmu.edu/sdi/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

1200 machines

4.4 trillion vertices

70.4 trillion edges

149 M edges traversed per second

1 petabyte

Graph Analysis Billions of Edges 10000 70,000 1,000 100 1 1.5 6.6 Twitter Yahoo! Facebook Accumulo

Accumulo is designed after Google s BigTable

BigTable powers hundreds of applications at Google

BigTable serves 2+ exabytes http://hbasecon.com/sessions/#session33

600 M queries per second organization wide

From 10 to 10,000

Starting with ten machines 10 1

One rack

1 TB RAM

10-100 TB Disk

Hardware failures rare

Test Application Designs

Designing Applications for Scale

Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts

Keys to Scaling Writes on all servers Few Scans

Hash / UUID Keys Key Value RowID Col Value usera:name Bob af362de4 Bob usera:age 43 usera:account $30 b23dc4be b98de2ff Annie Joe userb:name Annie c48e2ade $30 userb:age 32 userb:account $25 c7e43fb2 $25 d938ff3d 32 userc:name Joe e2e4dac4 59 userc:age 59 e98f2eab3 43 Uniform writes

Monitor Participating Tablet Servers MyTable Servers Hosted Tablets Ingest r1n1 1500 200k r1n2 1501 210k r2n1 1499 190k r2n2 1500 200k

Hash / UUID Keys RowID Col Value get(usera) af362de4 b23dc4be b98de2ff Bob Annie Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab3 43 3 x 1-entry scans on 3 servers

Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys

Group for Locality Key usera:name Value Bob RowID Col Value af362de4 name Annie usera:age 43 userb:name usera:account Annie $30 userb:age userb:name 32 Annie userc:name userb:age Fred 32 userc:age userb:account 29 $25 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob userd:name userc:name Joe e2e4dac4 age 43 userd:age userc:age 59 e2e4dac4 account $30 Still fairly uniform writes

Group for Locality RowID Col Value get(usera) af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server

Keys to Scaling Writes on all servers Few Scans Grouped Keys

Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 20140101 44 20140102 22 20140103 23 userb:age 32 userc:name Fred userc:age 29 userd:name Joe userd:age 59

Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 userc:age 29 userd:name Joe userd:age 59

Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 userc:age 20140106 29 27 userd:name 20140107 Joe 25 userd:age 20140108 59 17 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Always write to one server

No write parallelism

Temporal Keys RowID Col Value get(20140101 to 201404) 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Fetching ranges uses few scans

Keys to Scaling Writes on all servers Few Scans Temporal Keys

Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 0_20140101 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 32 1_20140102 22 userc:name Fred userc:age 29 userd:name Joe 2_20140103 23 userd:age 59 Uniform Writes

Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 0_20140101 44 0_20140104 25 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 1_20140102 22 1_20140105 31 userc:age 20140106 29 27 userd:name Joe 2_20140103 23 userd:age 59 2_20140106 27 Uniform Writes

Binned Temporal Keys Key Value RowID Col Value usera:name 20140101 Bob 44 usera:age 20140102 43 22 userb:name 20140103 Annie 23 userb:age 20140104 32 25 userc:name 20140105 Fred 31 userc:age 20140106 29 27 userd:name 20140107 Joe 25 userd:age 20140108 59 17 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 Uniform Writes

Binned Temporal Keys get(20140101 to 201404) RowID Col Value 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 One scan per bin

Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys

Keys to Scaling Key design is critical Group data under common row IDs to reduce scans Prepend bins to row IDs to increase write parallelism

Splits Pre-split or organic splits Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system Hundreds or thousands of tablets per server is ok Want at least one tablet per server

Effect of Compression Similar sorted keys compress well May need more data than you think to auto-split

Inserts are fast 10s of thousands per second per machine

Updates *can* be

Update Types Overwrite Combine Complex

Update - Overwrite Performance same as insert Ignore (don t read) existing value Accumulo s Versioning Iterator does the overwrite

Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Combine Things like X = X + 1 Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction Performance is same as inserts

Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 getaccount(userb) $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 After compaction c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Complex Some updates require looking at more data than Iterators have access to - such as multiple rows These require reading the data out in order to write the new value Performance will be much slower

Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe 35+30 = 65 c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe 35+30 = 65 c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

Planning a Larger-Scale Cluster 10 2-10 4

Storage vs Ingest Ingest Rate 1x1TB 12x3TB Millions of Entries per second 1000000 1000 1 10 120,000 12,000 10,000 1,200 1,000 120 100 10 100 1000 10000 Storage Terabytes

Model for Ingest Rates N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) A = 0.85 log 2 N * N * S Expect 85% increase in write rate when doubling the size of the cluster

Estimating Machines Required N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) N = 2 (log (A/S) / 0.7655347) 2 Expect 85% increase in write rate when doubling the size of the cluster

Predicted Cluster Sizes 12000 Number of Machines 9000 6000 3000 0 0 150 300 450 600 Millions of Entries per Second

100 Machines 10 2

Multiple racks

10 TB RAM

100 TB - 1PB Disk

Some hardware failures in the first week (burn in)

Expect 3 failed HDs in first 3 mo

Another 4 within the first year http://static.googleusercontent.com/media/ research.google.com/en/us/archive/disk_failures.pdf

Can process the 1000 Genomes data set 260 TB www.1000genomes.org

Can store and index the Common Crawl Corpus commoncrawl.org! 2.8 Billion web pages 541 TB

One year of Twitter 182 trillion tweets 483 TB http://www.sec.gov/archives/edgar/data/ 1418091/000119312513390321/d564001ds1.htm

Deploying an Application Users Clients Tablet Servers

May not see the affect of writing to disk for a while

1000 machines 10 3

Multiple rows of racks

100 TB RAM

1-10 PB Disk

Hardware failure is a regular occurrence

Hard drive failure about every 5 days (average). Will be skewed towards beginning of! the year

Can traverse the brain graph 70 trillion edges, 1 PB http://www.pdl.cmu.edu/sdi/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Facebook Graph 1s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf

Netflix Video Master Copies 3.14 PB http://www.businessweek.com/articles/2013-05-09/netflix-reedhastings-survive-missteps-to-join-silicon-valleys-elite

World of Warcraft Backend Storage 1.3 PB http://www.datacenterknowledge.com/archives/2009/11/25/ wows-back-end-10-data-centers-75000-cores/

Webpages, live on the Internet 14.3 Trillion http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html

Things like the difference between two compression algorithms start to make a big difference

Use range compactions to affect changes on portions of table

Lay off Zookeeper

Watch Garbage Collector and Namenode ops

Garbage Collection > 5 minutes?

Start thinking about NameNode Federation

Accumulo 1.6

Multiple NameNodes Accumulo Namenode Namenode DataNodes DataNodes Multiple HDFS Clusters

Multiple NameNodes Accumulo Namenode Namenode DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)

More Namenodes = higher risk of one going down.! Can use HA Namenodes in conjunction w/ Federation

10,000 machines 10 4

You, my friend, are here to kick a** and chew bubble gum

1 PB RAM

10-100 PB Disk

1 hardware failure every hour on average

Entire Internet Archive 15 PB http://www.motherjones.com/media/2014/05/ internet-archive-wayback-machine-brewster-kahle

A year s worth of data from the Large Hadron Collider 15 PB http://home.web.cern.ch/about/computing

0.1% of all Internet traffic in 2013 43.6 PB http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html

Facebook Messaging Data 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf

Facebook Photos 240 billion High 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_dhrubaborthakur.pdf

Must use multiple NameNodes

Can tune back heartbeats, periodicity of central processes in general

Can combine multiple PB data sets

Up to 10 quadrillion entries in a single table

While maintaining sub-second lookup times

Only with Accumulo 1.6

Dealing with data over time

Data Over Time - Patterns Initial Load Increasing Velocity Focus on Recency Historical Summaries

Initial Load Get a pile of old data into Accumulo fast Latency not important (data is old) Throughput critical

Bulk Load RFiles

Bulk Loading MapReduce RFiles Accumulo

Increasing velocity

If your data isn t big today, wait a little while

Accumulo scales up dynamically, online. No downtime

The first scale, can change size

Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process

Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes

Scaling Up Clients Accumulo HDFS master immediately assigns tablets

Clients Scaling Up Clients immediately begin querying new Tablet Servers Accumulo HDFS

Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes

Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes

Never really seen anyone do this

Except myself

20 machines in Amazon EC2

to 400 machines

all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back

Scaled back down to 20 machines when done

Just killed Tablet Servers

Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes

Other ways to go from 10 x to 10 x+1

Accumulo Table Export

followed by HDFS DistCP to new cluster

Maybe new replication feature

Newer Data is Read more Often

Accumulo keeps newly written data in memory

Block Cache can keep recently queried data in memory

Combining Iterators make maintaining summaries of large amounts of raw events easy

Reduces storage burden

Historical Summaries Unique Entities Stored Raw Events Processed 8000 6000 4000 2000 0 April May June July

Age-off iterator can automatically remove data over a certain age

IBM estimates 2.5 exabytes of data is created every day http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html

90% of available data created in last 2 years http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html

25 new 10k node Accumulo clusters per day

Accumulo is doing it s part to get in front of the big data trend

Questions?

@aaroncordova