HBase Terminology Translation Legend: tablet == region Accumulo Master == HBase HMaster Metadata table == META TabletServer (or tserver) == RegionServer Assignment is driven solely by the Master process. Assignment can be thought of as a state machine given the contents of the metadata table. The master keeps some transient information in memory. ZooKeeper is used only for liveliness checks on a TabletServer (ZooKeeper is checked by the Master, but also by TabletServers too; details follow later). As such, the metadata table must always be in a consistent state, a state that the master understands how to transition from, or (worst case scenario) a case that a reasonable fix can be made. Consistency of the metadata table (and of the updates written to it before actions are taken) is very important for assignment to work as intended. Lost updates to the metadata table would near certainly guarantee multiple assignment and data loss type bugs. Some quick definitions regarding the Tablet states in this state machine: Unassigned: Not online and is not scheduled to be assigned somewhere Assigned: Not online, but is scheduled to be assigned somewhere Hosted: Assigned to a server and that server brought the Tablet online (desired state) Assigned to dead server: The metadata table records that a Tablet is hosted, but the Master has noticed that the TabletServer which should be hosting it is dead. While the metadata table contains other information, for the purposes here, let s just assume that the metadata table only contains information about tablets. Each row in the metadata table defines a tablet. For the purposes of assignment, each row contains columns for the following: current location, future location, and last location. Last Location: The last location is used for preserving data locality. When assigning a tablet, the Master will observe the last location column, and attempt to assign the tablet back to that location. Not much more to say. The TabletServer updates this column after a compaction. Future Location: The future location marks that the Master wants to assign a given Tablet to a given server.. A tablet that is unassigned can first have its future set which will later trigger the Master to tell the TabletServer to bring the tablet online. This also helps with fault tolerance in the Master. For example, consider the Master failing during assignment. After calculating where a Tablet should be assigned but being restarted before completing the assignment, it s reasonable to consider that when the Master comes back that it can still assign the Tablet to the server. The negative
case where the TabletServer is no longer alive, it s a simple state transition to unset the future location and let the process happen again. Current Location: The current location stores the location of where a tablet is currently assigned. This is updated by the TabletServer during the final phase of assignment, not the Master. This is the last step before a Tablet is considered hosted. Another case when this column gets updated is when the Master notices that the server listed in this value for a Tablet is no longer alive. It will clear the current location as a part of this transition. General Assignment Loop Let s outline the most simple case for assignment. Consider a single Tablet which is currently offline. Let s say it s for a table that was just created. Its relevant assignment state would be as follows: {current=null, future=null, last=null} The master scans the metadata table periodically, looking for Tablets which are not hosted. Because the above Tablet has no value for current, we know that it is not hosted. Because there is no value for last, we can choose any available TabletServer because there is no locality to preserve. The master will take the state of active tabletservers in the cluster (based on ZooKeeper) and choose a TabletServer. The master will then record this server s information in the value for future. {current=null, future= server1:port, last=null} After setting the future value, the Master will inform server1:port that it should assign the Tablet. This is a one way Thrift call which is a fire and forget message. The remote end of the RPC cannot send a response back to the client. This lets the Master tell a TabletServer to bring a Tablet online, but doesn t require the Master to block on the RPC waiting for the TabletServer to actually perform the action. The TabletServer will (eventually) see the request from the Master to bring this Tablet online. After performing some precondition like checks, the TabletServer will make the necessary updates in its own memory to host this Tablet and then write an update to the metadata table for the current column and unset the future column. Writes by the TabletServer are only allowed after a cached check of the ZooKeeper lock. This helps ensure that we don t have a zombie server trying to host tablets due to delayed RPCs from the Master, but doesn t need to
be a sync ed ZK read. In the worst case scenario where a TabletServer loses its lock, it tanks itself quickly and the tablets hosted there move into a state capable of being reassigned. These updates let the Master know that the Tablet has moved from the assigned state and into the hosted state. Hooray. {current= server1:port, future=null, last=null} Later on, say a user wrote some data to this Tablet and a compaction is run to write the data in memory to disk. During the update of the Tablet to record the new file in HDFS for this Tablet, it will set the last location since there is locality to consider. {current= server1:port, future=null, last= server1:port } TabletStateStores A layer of abstraction which is relevant for assignment is the TabletStateStore. So far, we have only dealt with the assignment of user tables. This ignores how issues of how to bring the metadata table and root table. Consider each of these three levels of Tablets. Each horizontal bar corresponds to a Store of Tablets that need to be managed. As reading top down implies, there is a dependency that all Tablets in the Store above the current store are assigned. Concretely, before user table tablets can be assigned, the metadata tablets must be assigned. Likewise, before metadata table tablets can be assigned, the root table tablet must be assigned. This is not explicitly enforced because the necessary read operations will block while the previous level is unassigned. The same is not true for unassigning tablets. Unassigning tablets safely must be down bottom up to ensure that the necessary information can be persisted before the Tablet is taken offline. The metadata tables and user tables are stored in Accumulo as normal tables, the root table and the metadata table respectively, while the information to locate the root table s tablet is stored in ZooKeeper to bootstrap the system. As such, the same assignment logic can be reused across all three of these stores simply by changing the Accumulo table being read from (the metadata or root table) or from ZooKeeper (for the root table assignment).
Automatic Error Handling/Fixing One other task that the Master does perform WRT assignments is sanity checks on the current state of the Tablet entries. I believe many of these error checks have come across after years of finding a bug, diagnosing how the bug was caused, and then adding fixes to prevent the bug from happening again just also recognize if this state ever happens again and try to automatically recover from it. Many of these error conditions are recoverable, although some are checks for very serious problems (e.g. multiple assignments) and providing a big warning message. I believe many of these checks and fixes also are related to splitting and merging of tablets, and the failure (with pending retry) of these operations. The master can attempt to make a determination based on current state (such as active TabletServers) on how to fix an issue like a future location and a current location (the future location should be erased when the current is set), or removing a 2nd current location when it is on a dead TabletServer. Optimization or novel details Server side filter reading metadata table: As previously stated, the master is regularly scanning the metadata table looking for Tablets which are not in the hosted state. On a system with a large number of tablets, this can be a large amount of data to bring back to a single process (the master). As such, we can push down a custom server side filter (via an Accumulo iterator) that will only return Tablet records (a whole row) that are do not meet the criteria for being hosted. This reduces the amount of computation that the master needs to perform in addition to parallelizing this across multiple servers (ordering of Tablets to bring online within a Store is not necessary). Updates to metadata table are distributed: The Master doesn t have to coordinate all of the updates to the metadata table but can leave this information to the TabletServer perform the update. With the ability for a split metadata table (having multiple tablets), both reads and writes can be handled without becoming bottlenecked on a single server. This helps Accumulo scale beyond millions of tablets. Operations such as assignment can become limited by the speed in which an update to the metadata table can be made; however, this is a worthwhile optimization to pursue as necessary since it would likely also improve the normal user write case. Proactive messages from TabletServer to Master: As was mentioned earlier, the Master sends oneway (void) messages to TabletServers to avoid blocking RPCs. While the Master will eventually see all changes when it next reads the metadata table, the TabletServer will send a message back to the Master with what action it just
took. For example, after a Tablet is brought online, the TabletServer will send a message to the Master informing it that this happened. This update can wake up the Master from a sleeping state to more quickly respond to changes in the system, but if these messages are delayed/dropped, it s not a concern since we know we have the durability in the metadata table.