Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF
Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides linear scalability by dividing tables into regions and hosting regions on any number of region servers. The relationship between region servers and regions is 1 ϵ n. However, the use of a single active region server for any particular region causes resilience and performance problems. In this paper we present Non-Stop for Apache HBase, a product that uses an active-active consensus algorithm to provide multiple active region servers (possibly in different data centers) per region, along with multiple active HBase masters (s). This design alleviates problems caused by region server failure and offers important performance improvements based on load balancing and local data access. Design Goals In order to improve HBase resilience and scalability we want to change the cardinality of region servers to regions from 1 ϵ n to x ϵ n while maintaining full consistency of data. Specifically: data is fully available for reads and writes from any participating region server at any site. Changes to region data are coordinated and consistent, providing single-copy consistency to all clients. Any region server can be lost with no interruption of service. An entire data center can be lost with no interruption of service (provided a quorum of region servers remains). Schema changes and region server administration are coordinated and consistent between multiple active s. Any can be lost without interrupting service. Architecture The high level architecture is shown below in Figure 1. For any particular region, Non-Stop for Apache HBase provides a single logical region server composed of a quorum of region server nodes. Page 2 of 6
Technical Brief: -active region server clusters Data Center 1 Server Server Server WAN Data Center 2 Server Server Server Figure 1: Non-Stop for Apache HBase architecture showing multiple active region servers per region across two datacenters. (There is no theoretical limit on the number of data centers that can participate in the quorum.) The multiple active s serve two purposes. First, there is no interruption of service for schema changes and region server operations like splitting, even if an or entire cluster is lost. Second, Non-Stop for Apache HBase uses the in the client write path, so using multiple active s guarantees continuity of operations. Write Path The traditional write path in HBase is shown below in Figure 2: HBase write path. Applications Applications Client HBase Metadata Server WAL memstore HFile Find Server Find -META- region Put/Delete Write to WAL Write to memstore Flush to disk ZooKeeper -META- Find write region Figure 2: HBase write path Non-Stop for Apache HBase changes this write path in two important ways. First, writes are coordinated using WANdisco s DConE replication engine. (For more details on DConE and its previous applications to Hadoop, see this paper on WANdisco s Distributed Coordination Engine.) DConE guarantees that writes are ordered and applied consistently no matter the source of the activity and is resilient in the face of WAN latency and common failure modes. Note that the write-ahead log (WAL) is replaced by DConE s database, and all region servers update their memstore and local Page 3 of 6
Technical Brief: -active region server clusters DConE database as part of processing agreed transactions. Client HBase Master -META- Server DConE DConE database memstore HFile Find Server Find write region Put / Delete Agreement consensus on all nodes Coordinate write Agreement Server Write to app database Agreement handling on all nodes Write to memstore Flush to disk Figure 3: Non-Stop for Apache HBase write path featuring coordinated writes and simplified region lookup Second, Non-Stop for Apache HBase provides a modified HBase Master that replaces ZooKeeper for region server lookup. The modified HBase Master understands that any region can be hosted on a number of active region servers. As a side effect, eliminating ZooKeeper reduces a brittle part of the HBase architecture. Other Coordinated Activities In an active-active deployment other region activities must be coordinated as well. Figure 3 showed that flushing the memstore to disk can now be local or coordinated. More specifically, a local flush can occur at any time to relieve memory pressure on a region server. But only one region server in a quorum can write the HFile to persistent HDFS storage, so a local flush only writes to local storage. When it is time to flush to persistent storage, a coordinated flush has all region servers write an HFile at the same global sequence (GSN) to local storage. One of the region servers in the quorum then compacts to an HFile on HDFS. Page 4 of 6
Technical Brief: -active region server clusters GSN 37 01 0 1 001 0 11 1101 11 01 0 1 001 0 11 1101 11 GSN 46 Any node can write a local HFile at any time to relieve memory pressure Server Server Server Each node writes a local HFile at the same GSN GSN 50 GSN 50 GSN 50 GSN 50 (HDFS) One node compacts to HDFS, then local HFile dropped Local HDFS Replication Cross-DC Nonstop Hadoop replication Figure 4: Coordinated and local flushes in Non-Stop for Apache HBase Other coordinated activities include region server splits and merges. Applications Non-Stop for Apache HBase offers significant improvements in HBase resilience and scalability. Resilience With several active region servers in the quorum, Non-Stop for Apache HBase can tolerate the loss of one region server or an entire data center with no interruption to read or write activity. To be precise, a quorum of 2*F+1 nodes can tolerate F failures. A typical recovery point objective is minutes or less, and a typical recovery time objective is zero. Non-Stop for Apache HBase uses multiple state machines, and it is even possible to have different quorums for different regions in the same table. That would allow writes to continue into different regions from different data centers in the event of network partition, with automatic recovery when the network is restored. Page 5 of 6
Technical Brief: -active region server clusters Data Center 1 Data Center 1 Servers Write Ok No Quorum Servers Servers No Quorum Write Ok Servers Figure 5: Multiple state machines allow writes to occur at each location even during network partition Furthermore, an failure does not cause interruption of schema changes or region server administration (e.g., region splits). Scalability In the traditional HBase model, a single region server can become a performance bottleneck due to problems like region hotspots. Non-Stop for Apache HBase alleviates that problem by automatically load balancing client activity among several active region servers. Additionally, HBase clients in different geographic locations all gain the benefit of fast local read-write access. Alternatives HBase provides separate active-passive solutions for region server failover, limited load balancing, and disaster recovery. The introduction of timelineconsistent region server replicas eliminates down time for reads, but write activity is blocked until the region migrates to a new server. That process takes from 1-15 minutes depending on configuration. Native HBase replication is used on a per-table basis for balancing of read operations and disaster recovery. However, it must be configured for each table and does not support write activity on target nodes. (It operates in a masterslave or multi-master fashion, with no guarantee of consistency if writes originate in multiple places.) Network partition US Toll Free 1-877-WANDISCO (926-3472) Outside US +1 925 380 1728 EU +44 114 3039985 APAC +61 2 8211 0620 Email sales@wandisco.com