Some quick definitions regarding the Tablet states in this state machine:



Similar documents
Apache HBase. Crazy dances on the elephant back

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Cloud Computing at Google. Architecture

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Cloudera Manager Health Checks

Big Table A Distributed Storage System For Data

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

A programming model in Cloud: MapReduce

Hypertable Architecture Overview

Cloudera Manager Health Checks

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Workflow performance analysis tests

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

HDFS Users Guide. Table of contents

The Hadoop Distributed File System

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Bigdata High Availability (HA) Architecture

The Google File System

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Software Tender for Voice over IP Telephony SuperTel Incorporated

Network File System (NFS) Pradipta De

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File Systems

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Outline. Failure Types

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Apache Hama Design Document v0.6

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

Oracle Database 11g: SQL Tuning Workshop

Apache HBase: the Hadoop Database

SSIS Scaling and Performance

About PivotTable reports

Hadoop and Map-Reduce. Swati Gore

Top 10 reasons your ecommerce site will fail during peak periods

Completing the Big Data Ecosystem:

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

White Paper. Optimizing the Performance Of MySQL Cluster

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data With Hadoop

Features of AnyShare

Monitoring Microsoft Exchange to Improve Performance and Availability

How Lucene Powers LinkedIn Segmentation & Targeting Platform

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

We mean.network File System

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Achieving High Availability

User Guide. Version R91. English

Oracle Database 11g: SQL Tuning Workshop Release 2

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

HADOOP MOCK TEST HADOOP MOCK TEST I

Gladinet Cloud Backup V3.0 User Guide

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

VERITAS Cluster Server v2.0 Technical Overview

Deploying Hadoop with Manager

What's New in SAS Data Management

MICROSOFT EXCHANGE MAIN CHALLENGES IT MANAGER HAVE TO FACE GSX SOLUTIONS

Installation and Setup: Setup Wizard Account Information

Optimizing Your Database Performance the Easy Way

Testing Big data is one of the biggest

The Sierra Clustered Database Engine, the technology at the heart of

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SQL Server Administrator Introduction - 3 Days Objectives

Dr.Backup Release Notes - Version

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

PostgreSQL Concurrency Issues

FioranoMQ 9. High Availability Guide

Big Data Processing with Google s MapReduce. Alexandru Costan

ms-help://ms.technet.2005mar.1033/enu_kbntrelease/ntrelease/ htm

User Guide Release Management for Visual Studio 2013

Comparing SQL and NOSQL databases

How to Move an SAP BusinessObjects BI Platform System Database and Audit Database

Comparing Scalable NOSQL Databases

Ecomm Enterprise High Availability Solution. Ecomm Enterprise High Availability Solution (EEHAS) Page 1 of 7

SECTION 2 PROGRAMMING & DEVELOPMENT

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

The Integration Between EAI and SOA - Part I

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Distributed Data Management

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Hadoop & Spark Using Amazon EMR

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Configuring SQL Server Lock (Block) Monitoring With Sentry-go Quick & Plus! monitors

Big Data and Scripting Systems beyond Hadoop

MinCopysets: Derandomizing Replication In Cloud Storage

RAID Utility User Guide. Instructions for setting up RAID volumes on a computer with a Mac Pro RAID Card or Xserve RAID Card

CSCI 5980 TOPICS IN DISTRIBUTED SYSTEMS FINAL REPORT 1. Locality-Aware Load Balancer for HBase

CA DLP. Stored Data Integration Guide. Release rd Edition

Transcription:

HBase Terminology Translation Legend: tablet == region Accumulo Master == HBase HMaster Metadata table == META TabletServer (or tserver) == RegionServer Assignment is driven solely by the Master process. Assignment can be thought of as a state machine given the contents of the metadata table. The master keeps some transient information in memory. ZooKeeper is used only for liveliness checks on a TabletServer (ZooKeeper is checked by the Master, but also by TabletServers too; details follow later). As such, the metadata table must always be in a consistent state, a state that the master understands how to transition from, or (worst case scenario) a case that a reasonable fix can be made. Consistency of the metadata table (and of the updates written to it before actions are taken) is very important for assignment to work as intended. Lost updates to the metadata table would near certainly guarantee multiple assignment and data loss type bugs. Some quick definitions regarding the Tablet states in this state machine: Unassigned: Not online and is not scheduled to be assigned somewhere Assigned: Not online, but is scheduled to be assigned somewhere Hosted: Assigned to a server and that server brought the Tablet online (desired state) Assigned to dead server: The metadata table records that a Tablet is hosted, but the Master has noticed that the TabletServer which should be hosting it is dead. While the metadata table contains other information, for the purposes here, let s just assume that the metadata table only contains information about tablets. Each row in the metadata table defines a tablet. For the purposes of assignment, each row contains columns for the following: current location, future location, and last location. Last Location: The last location is used for preserving data locality. When assigning a tablet, the Master will observe the last location column, and attempt to assign the tablet back to that location. Not much more to say. The TabletServer updates this column after a compaction. Future Location: The future location marks that the Master wants to assign a given Tablet to a given server.. A tablet that is unassigned can first have its future set which will later trigger the Master to tell the TabletServer to bring the tablet online. This also helps with fault tolerance in the Master. For example, consider the Master failing during assignment. After calculating where a Tablet should be assigned but being restarted before completing the assignment, it s reasonable to consider that when the Master comes back that it can still assign the Tablet to the server. The negative

case where the TabletServer is no longer alive, it s a simple state transition to unset the future location and let the process happen again. Current Location: The current location stores the location of where a tablet is currently assigned. This is updated by the TabletServer during the final phase of assignment, not the Master. This is the last step before a Tablet is considered hosted. Another case when this column gets updated is when the Master notices that the server listed in this value for a Tablet is no longer alive. It will clear the current location as a part of this transition. General Assignment Loop Let s outline the most simple case for assignment. Consider a single Tablet which is currently offline. Let s say it s for a table that was just created. Its relevant assignment state would be as follows: {current=null, future=null, last=null} The master scans the metadata table periodically, looking for Tablets which are not hosted. Because the above Tablet has no value for current, we know that it is not hosted. Because there is no value for last, we can choose any available TabletServer because there is no locality to preserve. The master will take the state of active tabletservers in the cluster (based on ZooKeeper) and choose a TabletServer. The master will then record this server s information in the value for future. {current=null, future= server1:port, last=null} After setting the future value, the Master will inform server1:port that it should assign the Tablet. This is a one way Thrift call which is a fire and forget message. The remote end of the RPC cannot send a response back to the client. This lets the Master tell a TabletServer to bring a Tablet online, but doesn t require the Master to block on the RPC waiting for the TabletServer to actually perform the action. The TabletServer will (eventually) see the request from the Master to bring this Tablet online. After performing some precondition like checks, the TabletServer will make the necessary updates in its own memory to host this Tablet and then write an update to the metadata table for the current column and unset the future column. Writes by the TabletServer are only allowed after a cached check of the ZooKeeper lock. This helps ensure that we don t have a zombie server trying to host tablets due to delayed RPCs from the Master, but doesn t need to

be a sync ed ZK read. In the worst case scenario where a TabletServer loses its lock, it tanks itself quickly and the tablets hosted there move into a state capable of being reassigned. These updates let the Master know that the Tablet has moved from the assigned state and into the hosted state. Hooray. {current= server1:port, future=null, last=null} Later on, say a user wrote some data to this Tablet and a compaction is run to write the data in memory to disk. During the update of the Tablet to record the new file in HDFS for this Tablet, it will set the last location since there is locality to consider. {current= server1:port, future=null, last= server1:port } TabletStateStores A layer of abstraction which is relevant for assignment is the TabletStateStore. So far, we have only dealt with the assignment of user tables. This ignores how issues of how to bring the metadata table and root table. Consider each of these three levels of Tablets. Each horizontal bar corresponds to a Store of Tablets that need to be managed. As reading top down implies, there is a dependency that all Tablets in the Store above the current store are assigned. Concretely, before user table tablets can be assigned, the metadata tablets must be assigned. Likewise, before metadata table tablets can be assigned, the root table tablet must be assigned. This is not explicitly enforced because the necessary read operations will block while the previous level is unassigned. The same is not true for unassigning tablets. Unassigning tablets safely must be down bottom up to ensure that the necessary information can be persisted before the Tablet is taken offline. The metadata tables and user tables are stored in Accumulo as normal tables, the root table and the metadata table respectively, while the information to locate the root table s tablet is stored in ZooKeeper to bootstrap the system. As such, the same assignment logic can be reused across all three of these stores simply by changing the Accumulo table being read from (the metadata or root table) or from ZooKeeper (for the root table assignment).

Automatic Error Handling/Fixing One other task that the Master does perform WRT assignments is sanity checks on the current state of the Tablet entries. I believe many of these error checks have come across after years of finding a bug, diagnosing how the bug was caused, and then adding fixes to prevent the bug from happening again just also recognize if this state ever happens again and try to automatically recover from it. Many of these error conditions are recoverable, although some are checks for very serious problems (e.g. multiple assignments) and providing a big warning message. I believe many of these checks and fixes also are related to splitting and merging of tablets, and the failure (with pending retry) of these operations. The master can attempt to make a determination based on current state (such as active TabletServers) on how to fix an issue like a future location and a current location (the future location should be erased when the current is set), or removing a 2nd current location when it is on a dead TabletServer. Optimization or novel details Server side filter reading metadata table: As previously stated, the master is regularly scanning the metadata table looking for Tablets which are not in the hosted state. On a system with a large number of tablets, this can be a large amount of data to bring back to a single process (the master). As such, we can push down a custom server side filter (via an Accumulo iterator) that will only return Tablet records (a whole row) that are do not meet the criteria for being hosted. This reduces the amount of computation that the master needs to perform in addition to parallelizing this across multiple servers (ordering of Tablets to bring online within a Store is not necessary). Updates to metadata table are distributed: The Master doesn t have to coordinate all of the updates to the metadata table but can leave this information to the TabletServer perform the update. With the ability for a split metadata table (having multiple tablets), both reads and writes can be handled without becoming bottlenecked on a single server. This helps Accumulo scale beyond millions of tablets. Operations such as assignment can become limited by the speed in which an update to the metadata table can be made; however, this is a worthwhile optimization to pursue as necessary since it would likely also improve the normal user write case. Proactive messages from TabletServer to Master: As was mentioned earlier, the Master sends oneway (void) messages to TabletServers to avoid blocking RPCs. While the Master will eventually see all changes when it next reads the metadata table, the TabletServer will send a message back to the Master with what action it just

took. For example, after a Tablet is brought online, the TabletServer will send a message to the Master informing it that this happened. This update can wake up the Master from a sleeping state to more quickly respond to changes in the system, but if these messages are delayed/dropped, it s not a concern since we know we have the durability in the metadata table.