Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014
YARN 2
FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. 3
Hadoop is open source framework for big data. Both distributed storage and processing. Hadoop is reliable and fault tolerant with no rely on hardware for these properties. Hadoop has unique horisontal scalability. Currently from single computer up to thousands of cluster nodes. 4
What is HADOOP INDEED? BIG DATA = + x MAX BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA Why hadoop? 5
Beware Hadoop is designed for throughput, not for latency. HDFS blocks are expected to be large. There is issue with lot of small files. Write once, read many times ideology. MapReduce is not so flexible so any database built on top of it. How about realtime? HBase motivation 6
BUT WE OFTEN NEED LATENCY, SPEED and all Hadoop properties. HBase motivation 7
HBASE as is Architecture, data model, features. Something special But we are always special, don't you? INTEGRATION It's all not only about Hbase. Agenda 8
MANIFEST Open source Google BigTable implementation with appropriate infrastructure place. Limited but strict ACID guarantees. Realtime, low latency, linear scalability. Distributed, reliable and fault tolerant. Natural integration with Hadoop infrastructure. Really good for massive scans. Server side user operations. No any SQL. Secondary indexing is pretty complex. 9
High layer applications Resource management YARN Distributed file system 10
KEY USERS 11
HBase: the story begins with 2006 2007 2008 2009 2010 2014 future 2006, Google BigTable paper is published. HBase development starts November 2010, Facebook elected HBase to implement new messaging platform 2007, First code is released as part of Hadoop 0.15. Focus is on offline, crawl data storage 2010, HBase becomes Apache top-level project HBase 0.92 is considered production ready release 2008, HBase goes OLTP (online transaction processing). 0.20 is first performance release 12
HBase: it is NoSQL Book: title, author, pages, price Loose data structure Ball: color, size, material, price Kind Price Title Author Pages Book + + + Ball Toy car Toy car: color, type, radio control, price Color Size Material + + + + + + + Type Radio control + + Book #1: Kind, Price, Title, Author, Pages Book #2: Kind, Price, Title, Author Ball #1: Kind, Price, Color, Size, Material Toy car #1: Price, Color, Type +Radio control Data looks like tables with large number of columns. Columns set can vary from row to row. No table modification is needed to add column to row. 13
Logical data model Data is placed in tables. Every row consists of columns. Table Region Row Key Family #1 Column Tables are split into regions based on row key ranges. Region Every table row is identified by unique row key. Column Family #2 Columns are grouped into families. 14
Real data model Table Region Row Data is stored in HFile. Region Key Families are stored on disk in separate files. Row keys are indexed in memory. Column includes key, qualifier, value and timestamp. No column limit. Storage is block based. HFile: family #1 Family #1 Column Family #2 Column Delete is just another marker record. Periodic compaction is required. HFile: family #2 Row key Column Value TS Row key Column Value TS 15
Hbase: infrastructure view Zookeeper coordinates distributed elements and is primary contact point for client. META Master server keeps metadata and manages data distribution over Region servers. Zookeeper Master Client DATA Clients locate master through ZooKeeper then needed regions through master. RS RS Clients directly communicate with region server for data. RS RS Region servers manage data table regions. 16
Together with HDFS Zookeeper coordinates distributed elements and is primary contact point for client. META Master server keeps metadata and manages data distribution over Region servers. Zookeeper Client Region servers manage data table regions. Master NameNode DATA Clients locate master through ZooKeeper then needed regions through master. RS RS RS RS RS RS DN DN DN DN DN DN Rack Clients directly communicate with region server for data. Rack Rack Actual data storage service including replication is on HDFS data nodes. 17
KEY OPERATIONS No difference if we add data or replace existing one. PUT Get data eleent by key: rows, columns. GET SCAN Massive GET with key range. DELETE DELETE single object BATCH OPERATIONS ARE POSSIBLE 18
CLOSER VIEW 19
Actual write is to region server. Master is not involved. All requests are coming to WAL (write ahead log) to provide recovery. Region server keeps MemStore as temporary storage. Only when needed write is flushed to disk (into HFile). 20
WHY IS IT FAST? CRUD: Put and Delete Memory is intensively used. Writes are logged and cached in memory. Reads are just cached. Lower layer is WRITE ONLY filesystem (HDFS). So both PUT and DELETE path is identical. DELETE is just another marker added. Both PUT and DELETE requests are per row key. No row key range for DELETE. Actual DELETE is performed during compactions. 21
CRUD: Get and Scan Get operation is implemented through Scan. Get operation is simple data request by row key. Scan operation is performed based on row key range which could involve several table regions. Both Get and Scan can include client filters expressions that are processed on server side and can seriously limit results so traffic. Both Scan and Get operations can be performed on several column families. 22
SERVER SIDE TRICKS Coprocessors is feature that allows to extend HBase without product code modification. RegionObserver can attach code to operations on region level. Similar functionality exists for Master. Endpoints is the way to provide functionality equal to stored procedure. Together coprocessor infrastructure can bring realtime distributed processing framework (lightweight MapReduce). 23
Coprocessors: Region observer Region observers can be stacked. Region observer works like hook on region operations. Request Table Regionobserver observer Region Region observer Regionobserver observer Region Region observer Region Region RegionServer RegionServer Client Result 24
Coprocessors: Endpoints Direct communication via separate protocol. Request (RPC) Endpoint Endpoint Region Region Response Client Your commands can have effect on table regions. Table RegionServer RegionServer 25
WHY SERVER SIDE IS BLACK MAGIC? YOU ARE MODIFYING REGION SERVER OR MASTER CODE ANY MISTAKE LEADS TO HELL JAVA CLASS LOADER REQUIRES SERVICE RESTART ON RELOAD ANY MODIFICATION LEADS TO HELL 26
Integration with MapReduce INTEGRATION 27
MAP+REDUCE + HBASE Integration with MapReduce HBase provides number of classes for native MapReduce integration. Main point is data locality. TableInputFormat allows massive MapReduce table processing (maps table with one region per mapper). HBase classes like Result (Get / Scan result) or Put (Put request) can be passed between MapReduce job stages. Not so much difference between MR1 and YARN here. HMaster META JobTracker NameNode RegionServer Ofen single node so data is local TaskTracker DataNode DATA 28
Bulk load MAP REDUCE CLASSICS Hbase table data is mapped. One mapper per table region so mapped data are processed locally. After local (!) mapping data is reduced. This can be non-local processing but it is much more light. So we receive almost 100% distributed local data processing around the Hadoop cluster. Mappers HBase table Table region Mapper Table region Mapper Table region Mapper Reducers Reducer 29
BULK LOAD Bulk load There is ability to load data in table MUCH FASTER. Hbase internal storage files (HFile) are prepared. It is preferable to generate one HFile per table region. MapReduce can be used. Prepared HFile is merged with table storage on maximum speed. Mappers Data importers Reducers HFile generator HFile Table region HFile generator HFile Table region HFile generator HFile Table region 30
SECONDARY INDEX THROUGH COPROCESSORS Table Client Put / Delete Scan with filter Region Region observer Index update Index table Index search HBase has no secondary indexing out-of-the-box. Coprocessor (RegionObserver) is used to track Put and Delete operations and update index table. Scan operations with index column filter are intercepted and processed based on index table content. 31
INDEX ALTERNATIVE: SOLR INDEX UPDATE INDEX QUERY Index update request is analyzed, tokenized, transformed and the same is for queries. Search responses SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX But it can index ANYTHING. Search result is document ID 32
HBase handles user data change online requests. NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests. Indexes are built on SOLR so HBase data are searchable. 33
HBase: Data and search integration User just puts (or deletes) data. Data update Replication can be set up to column family level. HBase cluster REPLICATION Translates data changes into SOLR index updates. Lily HBase NRT indexer HBase regions Client Apache Zookeeper does all coordination HDFS Search requests (HTTP) SOLR cloud Search responses Finally provides search Serves low level file system. 34
Questions and discussion 35