and what s upcoming Headline Goes Here

Transcription

1 Apache HBase Where we ve been DO NOT USE PUBLICLY and what s upcoming PRIOR TO 10/23/12 Headline Goes Here Jonathan Speaker Name or Subhead Goes Here SoMware Engineer at Cloudera HBase PMC Member BigData.be April 4, 2014

2 Who Am I? Cloudera: Tech Lead HBase Team So<ware Engineer Apache HBase commiver / PMC Apache Flume founder / PMC U of Washington: Research in Distributed Systems

3 What is Apache HBase? App ZK MR HDFS Apache HBase is a reliable, column- oriented data store that provides consistent, low- latency, random read/ write access.

4 Where We ve Been An HBase History

5 Apache HBase Timeline Nov 06: Google BigTable OSDI 06 Summer 09 StumbleUpon goes producdon on HBase ~0.20 Summer 11: Messages on HBase Summer 11: Web Crawl Cache May 12: HBaseCon 2012 Nov 11: Cassini on HBase Jun 13: HBaseCon 2013 Jan 13 Phoenix on HBase Apr 07: First Apache HBase commit as Hadoop contrib project Jan 08: Promoted to Hadoop subproject Apr 10: Apache HBase becomes top level project Apr 11: CDH3 GA with HBase Jan 12: May 12: Oct 13: Feb 13:

6 Developer Community Acdve community! Diverse commivers from many organizadons

7 Apache HBase Nascar Slide

8 Apache HBase Core Development Vendors Self Service

9 Apache HBase Sample Users Inbox Storage Web Search Analydcs Monitoring

10 Apache HBase Ecosystem Projects

11 Today: Apache Disaster recovery, Condnuity, and MTTR

12 HBase provides Low- latency Random Access Writes: 1-3ms, 1k- 20k writes/sec per node Reads: 0-3ms cached, 10-30ms disk 10k- 40k reads / second / node from cache Cell size: 0B- 3MB Read, write, and insert data anywhere in the table

13 Core Properdes ACID guarantees on a row Writes are durable Strong consistency first, then availability AMer failure, recover and return current value instead of returning stale value CAS and atomic increments can be efficient. Sorted By Primary Key Short scans are efficient Parddoned by Primary Key Log Structured Merged Tree Writes are extremely efficient Reads are efficient Periodic layout opdmizadons for read opdmizadon ( compacdons ) required.

14 Cridcal Features Disaster Recovery Cluster Replicadon Table Snapshots Copy Table Import / Export Tables Metadata Corrupdon repair tool (hbck) AdministraMve and ConMnuity Kerberos based Authendcadon ACL based Authorizadon Config change via rolling restart. Within version rolling upgrade. Protobuf based wire protocol for RPC future proofing

15 Hardened for Table AdministraMon Online Schema change Online Region Merging Condnuous fault injecdon tesdng with Chaos Monkey Performance Tuning Alternate key encodings for efficient memory usage Exploring Compactor policy minimizes compacdon storms Smart and Adapdve Stochasdc region load balancer Fast split policy for new tables

16 Mean Time to Recovery (MTTR) Region unavailable Region available client unaware Region available client aware detect repair nodfy recovered Machine failures happen in distributed systems Average unavailability when automadcally recovering from a failure. Recovery dme for a unclean data center power cycle

17 Fast nodficadon and detecdon (0.96) Region unavailable Region available for RW detect split assign replay recovered hdfs hdfs hdfs Proacdve nodficadon of HMaster failure (0.96) Proacdve nodficadon of RS failure (0.96) Nodfy client on recovery (0.96) Fast server failover (Hardware)

18 Distributed log replay (0.96) Region unavailable Region available for replay writes Region available for RW detect assign split + replay recovered hdfs Previously had two IO intensive passes: Log splisng to intermediate files Assign and log replay Now just one IO heavy pass: Assign first, then split+replay. Improves read and write recovery dmes. Off by default currently*. *Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

19 What s Upcoming A Future HBase

20 Outline Improved Mean dme to recovery (MTTR) Improved Predictability Improved Usability Improved Muldtenancy

21 Improving MTTR Further Faster read recovery

22 Distributed log replay (0.96*) Region unavailable Region available for replay writes Region available for RW detect assign split + replay recovered hdfs Previously had two IO intensive passes: Log splisng to intermediate files Assign and log replay Now just one IO heavy pass: Assign first, then split+replay. Improves read and write recovery dmes. Off by default currently*. *Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

23 Distributed log replay with fast write recovery Region unavailable Region available for all writes Region available for RW detect assign split + replay recovered hdfs Writes in HBase do not incur reads. With distributed log replay, we ve already have regions open for write. Allow fresh writes while replaying old logs*. *Caveat: If you override dme stamps you could have READ REPEATED isoladon violadons (use tags to fix this)

24 Fast Read Recovery (proposed) detect Region unavailable assign Can guarantee no new edits? Region available for all RW recovered Can guarantee we have all edits? Region available for all RW Idea: Prisdne Region fast read recovery If region not edited it is consistent and can recover RW immediately Idea: Shadow Regions for fast read recovery Shadow region tails the WAL of the primary region Shadow memstore is one HDFS block behind, catch up recover RW Currently some progress for trunk

25 Improving Predictability Improving the 99%dle

26 Common causes of performance variability Locality Loss Favored Nodes, HDFS block affinity Compacdon GC* Exploring compactor Off- heap Cache Hardware hiccups MulM WAL, HDFS speculamve read

27 Performance degraded amer recovery recovery Service recovered; degraded performance L recovered Performance recovered because compacdon restores locality J performance recovered AMer recovery, reads suffer a performance hit. Regions have lost locality To maintain performance amer failover, we need to regain locality. Compact Region to regain locality We can do bever by using HDFS features

28 Read Throughput: Favored Nodes (0.96*) Service recovered; performance sustained because region assigned to favored node. J recovery performance recovered Control and track where block replicas are All files for a region created such that blocks go to the same set of favored nodes When failing over, assign the region to one of those favored nodes. Currently a preview feature in 0.96 Disabled by default because it doesn t work well with the latest balancer or splits. Will likely use upcoming HDFS block affinity for bever operability Originally on Facebook s 0.89, ported to 0.96

29 Read latency: HDFS hedged read (CDH5.0) HBase s Region servers use HDFS client to reads 1 of 3 HDFS block replicas If you chose the slow node, your reads are slow. RS Hdfs replicas Slow read! Too slow, read other replica If a read is taking too long, speculadvely go to another that may be faster. RS Hdfs replicas 1 2 3

30 Read latency: Read Replicas (in progress) HBase client reads from primary region servers. If you chose the slow node, your reads are slow. Hbase Client 1 Slow read! Idea: Read replica assigned to other region servers. Replicas periodically catch up (via snapshots or shadow region memstores) Client specifies if stale read OK. If a read is taking too long, speculadvely go to another that may be faster. Hbase Client Region replicas Too slow, read stale replica

31 Write latency: Muldple WALs (in progress) HBase s HDFS client writes 3 replicas Min write latency is bounded by the slowest of the 3 replicas Idea: If a write is taking too long let s duplicate it on another set that may be faster. RS Hdfs replicas RS Hdfs replicas Hdfs replicas Slow Write Too slow, write to other replica

32 MR over Table Snapshots (0.98, CDH5.0) Previously MapReduce jobs over HBase required online full table scan map map map map map map map map reduce reduce reduce Idea: Take a snapshot and run MR job over snapshot files Doesn t use HBase client Avoid affecdng HBase caches 3-5x perf boost. snapshot map map map map map map map map reduce reduce reduce

33 Improving Usability Autotuning, Tracing, and SQL

34 Making HBase easier to use and tune. Difficult to see what is happening in HBase Easy to make poor design decisions early without realizing New Developments Memory auto tuning HTrace + Zipkin Frameworks for Schema design

35 Memory Use Auto- tuning Memory is divided between the memstore (used for serving recent writes) the block cache (used for read hot spots) Need to choose balance for work load Read Heavy memstore Balanced Block cache Write heavy Block cache memstore Block cache memstore

36 HTrace Problem: Where is dme being spent inside HBase? Soludon: HTrace Framework Inspired by Google Dapper Threaded through HBase and HDFS Tracks dme spent in calls in a distributed system by tracking spans* on different machines. *Some assembly sdll required.

37 HTrace: Distributed Tracing in HBase and HDFS Framework Inspired by Google Dapper Tracks dme spent in calls in RPCs across different machines. Threaded through HBase (0.96) and future HDFS. HBase Client HDFS HDFS HBase HBase DN NN RS meta ZK 1 RPC calls A span

38 Zipkin Visualizing Spans UI + Visualizadon System WriVen by TwiVer Zipkin HBase Storage Zipkin HTrace integradon View where dme from a specific call is spent in HBase, HDFS, and ZK.

39 HBase Schemas HBase Applicadon developers must iterate to find a suitable HBase schema Schema crimcal for Performance at Scale How can we make this easier? How can we reduce the experdse required to do this? Today: Lots of tuning knobs Developers need to understand Column Families, Rowkey design, Data encoding, Some are expensive to change amer the fact

40 How should I arrange my data? Isomorphic data representadons! Short Fat Table using column qualifiers Rowkey d:col1 d:col2 d:col3 d:col4 bob aaaa bbbb cccc dddd jon eeee ffff gggg hhhhh Short Fat Table using column families Rowkey col1: col2: col3: col4: bob aaaa bbbb cccc dddd jon eeee ffff gggg hhhhh Tall skinny with compound rowkey rowkey d: bob- col1 aaaa bob- col2 bbbb bob- col3 cccc bob- col4 dddd jon- col1 jon- col2 jon- col3 jon- col4 eeee ffff gggg hhhh

41 How should I arrange my data? Isomorphic data representadons! Short Fat Table using column qualifiers Rowkey d:col1 d:col2 d:col3 d:col4 rowkey With great power comes great responsibility! bob aaaa bbbb cccc dddd jon eeee ffff gggg hhhhh Short Fat Table using column families Rowkey col1: col2: col3: col4: bob aaaa bbbb cccc dddd jon eeee ffff gggg hhhhh Tall skinny with compound rowkey d: bob- col1 aaaa bob- col2 bbbb bob- col3 cccc bob- col4 dddd jon- col1 jon- col2 jon- col3 jon- col4 eeee How can we make this easier for users? ffff gggg hhhh

42 Impala Scalable Low- latency SQL querying for HDFS (and HBase!) ODBC/JDBC driver interface Highlights Use s Hive metastore and its hbase- hbase connector configuradon convendons. Nadve code implementadon, uses JIT for query execudon opdmizadon. Authorizadon via Kerberos support Open sourced by Cloudera hvps://github.com/cloudera/impala

43 Phoenix A SQL skin over HBase targedng low- latency queries. JDBC SQL interface Highlights Adds Types Handles Compound Row key encoding Secondary indices in development Provides some pushdown aggregadons (coprocessor). Open sourced by Salesforce.com Work from James Taylor, Jesse Yates, et al hvps://github.com/forcedotcom/phoenix

44 Kite (nee Cloudera Development Kit/CDK) APIs that provides a Dataset abstracmon Provides get/put/delete API in avro objects HBase Support in progress Highlights Supports muldple components of the hadoop distros (flume, morphlines, hive, crunch, hcat) Provides types using Avro and parquet formats for encoding enddes Manages schema evoludon Open source by Cloudera hvps://github.com/kite- sdk/kite

45 Muld- tenancy Many apps and users in a single cluster

46 Growing HBase Pre : scaling up HBase for single HBase applicadons Essendally a single user for single app. Ex: Facebook messages, one applicadon, many hbase clusters Shard users to different pods Focused on condnuity and disaster recovery features Cross- cluster Replicadon Table Snapshots Rolling Upgrades # of clusters One giant applicadon, Muldple clusters Scalability # of isolated applicadons

47 Growing HBase In 0.96 we introduce primidves for suppordng MulMtenancy Many users, many applicadons, one HBase cluster Need to have some control of the interacdons different users cause. Ex: Manage for MR analydcs and low- latency serving in one cluster. # of clusters One giant applicadon, Muldple clusters Scalability Many applicadons In one shared cluster muldtenancy # of isolated applicadons

48 Namespaces (0.96) Namespaces provide an abstracdon for muldple tenants to create and manage their own tables within a large HBase instance. Namespace blue Namespace green Namespace orange

49 Muldtenancy goals Security (0.96) A separate admin ACLs for different sets of tables Quotas (in progress) Max tables, max regions. Performance Isoladon (in progress) Limit performance impact load on one table has on others. Priority (future) Handle some tables before others

50 Isoladon with Region Server Groups Namespace blue Namespace green Namespace orange Region assignment distribudon (no region server groups)

51 Isoladon with Region Server Groups Namespace blue Namespace green Namespace orange Region assignment distribudon with Region Server Groups (RSG) RSG blue RSG green orange

52 Cell Tags Mechanism for avaching arbitrary metadata to Cells. Modvadon: Finer- grained isoladon Use for Accumulo- style cell- level visibility Main feature for 0.98 Other uses: Add sequence numbers to enable correct fast read/write recovery Potendal for schema tags

53 Conclusions

54 Summary by Version New Features MTTR 0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next (0.98 / 1.0.0) Stability Reliability Condnuity Muldtenancy Recovery in Hours Recovery in Minutes Recovery of writes in seconds, reads in 10 s of seconds Perf Baseline BeVer Throughput Opdmizing Performance Recovery in Seconds (reads+writes) Predictable Performance Usability HBase Developer Experdse HBase Operadonal Experience Distributed Systems Admin Experience Applicadon Developers Experience

55