Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase



Similar documents
Hadoop Ecosystem B Y R A H I M A.

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Apache HBase. Crazy dances on the elephant back

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Dominik Wagenknecht Accenture

Upcoming Announcements

Big Data Analytics - Accelerated. stream-horizon.com

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How Companies are! Using Spark

Hadoop IST 734 SS CHUNG

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Trafodion Operational SQL-on-Hadoop

Big Data Course Highlights

Architectures for Big Data Analytics A database perspective

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

How To Create A Data Visualization With Apache Spark And Zeppelin

HDP Hadoop From concept to deployment.

HADOOP MOCK TEST HADOOP MOCK TEST I

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

HADOOP. Revised 10/19/2015

TRAINING PROGRAM ON BIGDATA/HADOOP

A Brief Introduction to Apache Tez

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Analytics Nokia

Real Time Big Data Processing

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop: Embracing future hardware

Hadoop & Spark Using Amazon EMR

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Unified Big Data Analytics Pipeline. 连 城

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Workshop on Hadoop with Big Data

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Native Connectivity to Big Data Sources in MSTR 10

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Moving From Hadoop to Spark

HiBench Introduction. Carson Wang Software & Services Group

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Challenges for Data Driven Systems

Putting Apache Kafka to Use!

How To Handle Big Data With A Data Scientist

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Scale Out Of A Nosql Database

SQL on NoSQL (and all of the data) With Apache Drill

The Future of Data Management

BIG DATA TRENDS AND TECHNOLOGIES

Splice Machine: SQL-on-Hadoop Evaluation Guide

Introducing Storm 1 Core Storm concepts Topology design

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

Real-time Big Data Analytics with Storm

Big Data With Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Unified Batch & Stream Processing Platform

Unified Big Data Processing with Apache Spark. Matei

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Comprehensive Analytics on the Hortonworks Data Platform

Large scale processing using Hadoop. Ján Vaňo

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Apache HBase: the Hadoop Database

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

CitusDB Architecture for Real-Time Big Data

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Big data blue print for cloud architecture

#TalendSandbox for Big Data

Big Data Use Case: Business Analytics

Using In-Memory Computing to Simplify Big Data Analytics

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Case Study : 3 different hadoop cluster deployments

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Extending Hadoop beyond MapReduce

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Constructing a Data Lake: Hadoop and Oracle Database United!

Deploying Hadoop with Manager

The Internet of Things and Big Data: Intro

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Certified Big Data and Apache Hadoop Developer VS-1221

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

I/O Considerations in Big Data Analytics

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Transcription:

Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase

Who am I? Distributed systems engineer Principal Architect in the Big Data Platform Engineering Group at Intel Distributed systems software Operating systems (networking, device drivers) High speed networking Compiler design and implementation Apache OSS guy Long time open source participant and advocate Early Hadoop adopter, in 2007 Committer and PMC member on the Apache HBase project, since 2008

Real time"? "Of or relating to computer systems that update information at the same rate as they receive data" (Free Dictionary) The classic definition - Millisecond or smaller In the context of big data, this has limited applicability

Real time"? "Denoting or relating to a data-processing system in which a computer receives constantly changing data [ ] and processes it sufficiently rapidly to be able to control the source of the data" (Free Dictionary) This is closer Seconds? Minutes? Hours? On what timescale can your business processes control the sources of data?

So Near-real-time then? "Of or relating to computer systems that update information with a time delay introduced, by automated data processing or network transmission, between the occurrence of an event and the use of the processed data. [ ] The delay in near real-time can be as high as 15 to 20 minutes in some applications." (Wikipedia) Closer to on the ground conditions in business We are not talking about direct control loops, humans make the decisions

Real-time data analysis "Analysis available at the instant we need it" (Your stakeholders) Is this about data processing or data exploration?

Real-time data analysis Only a small subset of applications require truly realtime (ie. stream based) processing Sometimes When extremely fresh data is required for decisions Mostly When there are control loops not including humans

Real-time data analysis Analysis of data can lead to new insights Insights can save costs and grow revenue Faster analysis can generate greater revenue by providing advantage over slower competitors But that analysis is still a human process, the analyst's query leading to a decision "Real-time" is what business analysts and executives often ask for, but what they often actually want is something else...

Interactive Data Exploration Image credit: chartio.com

Interactive Data Exploration Front end Javascript visualization libraries, e.g. D3 SQL + JDBC or ODBC driver Middle tier Web server Application server Analytic query engine Back end Data ingest Data store Compute fabric Data

The 3Vs of Big Data What makes Big Data big? Volume Variety Velocity

Apache Hadoop and the 3Vs Apache Hadoop provides individuals and businesses a cost effective and highly flexible platform for ingest and processing of a high volume of data with a high degree of variety And velocity?

Apache Hadoop and the 3Vs Apache Hadoop alone is not a good choice as the engine for interactive data exploration The velocity of an answer coming back from a query is constrained by an impedance mismatch Apache Hive is the de-facto interface to Hadoop, and its engine is MapReduce* MapReduce is optimized for fault tolerance and throughput, not latency * - Efforts underway to improve here have not yet delivered production quality software

Apache HBase and V[3]* Application architectures including Apache HBase can give business processes a higher velocity The velocity of an HBase query is orders of magnitude faster than MapReduce Millisecond latencies for point queries Optimized short scanning for on-demand rollups A great persistence option for interactive exploration of data sets, especially if you need on-demand rollups * - Yes, this is an off-by-one error

Why Apache HBase? Versus the RDBMS Operations are simple and complete quickly No black box optimizer - Performance is predictable Versus other NoSQL Provides strongly consistent semantics Automatically shards and scales (just turn on more servers) Strong security: Authorization, ACLs, encryption* First class Hadoop integration: HDFS storage, MapReduce input/output formats, bulk load * - Under development: HBASE-7544

The HBase System Model (Standby) ZooKeeper ensemble HBase Master ZooKeeper ZooKeeper ZooKeeper HBase Master HBase RegionServers RegionServer RegionServer RegionServer RegionServer RegionServer DataNode DataNode DataNode DataNode DataNode Hadoop Distributed File System (HDFS) Server Server Server Horizontal Scalability Server Server

The HBase Data Model Not a spreadsheet Think of tags Values of any length, no predefined names or widths

Application Pattern 1 Traditional HBase application configuration Ingest: [HDFS, HBase] Materialization: HDFS MapReduce HBase Computation: HBase MapReduce HBase Update: User Application server HBase Query: HBase Application server User Data

HBase Coprocessors In-process system extension framework Observers (Like triggers) Endpoints (Like stored procedures) System integrators can deploy application code that runs where the data resides Location aware Automatic sharding, failover, and client request dispatch

Application Pattern 2 Hosted data-centric coprocessor application Application specific API Ingest: [HDFS, HBase] Materialization: HDFS MapReduce HBase Computation: HBase HBase (via coprocessor) HBase Update: User Application server HBase Query: HBase Application server User Data

Salesforce Phoenix SQL query engine deeply integrated with HBase Delivered as a JDBC driver Targets low latency queries over HBase data Columns modeled as multi-part row key and key values Secondary indexes Query engine transforms SQL into puts, delete, scans Uses native HBase APIs instead of Map/Reduce Brings the computation to the data: Aggregates, inserts, and deletes data on the server using coprocessors

Application Pattern 3 Applications interface with the Phoenix JDBC driver and familiar SQL Interactive exploration with Phoenix structured queries Ingest: [HDFS, Phoenix] Materialization: HDFS Phoenix (via MR) HBase Update: User Phoenix HBase Query: HBase Phoenix User Data

Dremel-style analytic engines Google Dremel Interactive ad-hoc query system For analysis of read-only nested data Combines multi-level executor trees with highly optimized and clever columnar data layouts Claims aggregation queries over trillion-row tables in seconds Apache Drill Dremel inspired but taking a broader approach Incubating Apache project no release yet

Dremel-style analytic engines

Application Pattern 4 Co-location of query servers with HBase regionservers Fast updates directly to HBase Interactive exploration with analytic query engine Ingest: [HDFS, HBase] Update: User HBase Query: [HDFS, HBase] Query Engine User Data

In-memory analytics Apache Spark Developed for in-memory iterative algorithms, for machine learning and interactive data mining use cases Aggressive use of memory and local parallel communication patterns Can additionally trade between accuracy and time Avoiding MapReduce's materialization to disk of intermediate data produces speed ups measured at 100x for some cases Shark: An Apache Hive-compatible query engine for Spark demonstrating similar performance gains

Application Pattern 5 In-memory / iterative computation with Spark Fast updates directly to HBase Interactive exploration with Shark Ingest: [HDFS, HBase] Materialization: HDFS Spark HBase Computation: [HDFS, HBase] Spark HBase Update: User HBase Query: [HDFS, HBase] Spark Shark User Data

Streaming Computation Streaming computation: Instead of running queries over data, we run data over (continuous) queries Streaming application architectures pair an event bus for distributing data to the computation fabric, and the streaming computation framework Event buses: Apache Flume Apache Kafka Computation frameworks: Apache Storm (incubating) Apache Samza (incubating) Queries Data Results Database Query Model Data Queries Results Streaming Computation

Streaming Computation Local in-memory computation can happen milliseconds, meets the real-time definition Input is an effectively infinite sequence of items Limited available memory Limited processing time budget in

Streaming Computation and HBase HBase is a good persistence option for streaming systems because of its predictable per-operation latencies HBase is not a good option for hosting the streaming computation framework Technically achievable with coprocessors A database makes a poor message queue, a distributed database makes a poor distributed message queue To overcome impedance mismatches, the coprocessors for stream processing will need to override much of Hbase's default behavior

Apache Storm (incubating) A distributed stream processing framework Tuple: Individual data item, unit of work Spouts: Tuple sources Bolts: Process tuples Simple scale out architecture Uses Apache Zookeeper for cluster coordination Guaranteed message processing semantics Co-location not recommend due to CPU demands Must be paired with event sources The Storm cluster will internally do its own message passing but needs events delivered to its spouts

Apache Storm (incubating) Tuples flow over the edges Bolts are arbitrary code - Can create more streams - Can create new tuples - Can update a database - Other side-effects... Image credit: storm-project.net

Apache Samza (incubating) A distributed stream processing framework integrated with an event bus Uses Apache Kafka for messaging Uses Apache Hadoop's YARN for fault tolerance, processor isolation, security, and resource management Also uses Apache Zookeeper for cluster coordination Kafka guarantees no message will ever be lost Streams are ordered, replayable, and fault tolerant Key differences from Storm: Colocated with Hadoop All streams are materialized to disk Messages are never out of order

Application Pattern 6 Real-time computation with Storm or Samza Kafka as event bus HBase for persistence Analytic query engine for exploration Ingest: Kafka Computation: Kafka [Storm, Samza] [Kafka, HBase] Update: User [Kafka, HBase] Query: HBase Query Engine User Data

Application Pattern 7 Real-time computation with Storm or Samza Kafka as event bus HBase for persistence Batch computation with MapReduce Analytic query engine for exploration Ingest: [HDFS, Kafka] Materialization: HDFS MapReduce HBase Computation: Kafka [Storm, Samza] [Kafka, HBase] Update: User [HBase, Kafka] Query: HBase Query Engine Data User

Application Pattern 8 Real-time computation with Storm or Samza Kafka as event bus HBase for persistence In memory computation with Spark Shark for exploration Ingest: [HDFS, Kafka] Materialization: HDFS Spark HBase Computation: Kafka [Storm, Samza] [Kafka, HBase] Update: User [HBase, Kafka] Query: HBase Shark User Data

Near-real-time indexing NGDATA hbase-indexer HBase replication: Distinct from HDFS block replication Transfers edits from one HBase cluster to another via background edit log shipping Key insight: Updates to HBase can be sent to an indexing engine, e.g. Apache Solr, instead of a remote cluster Index maintenance without coprocessors Indexing is performed asynchronously, so it does not impact write throughput on HBase

Near-real-time indexing HBase row updates are mapped to Solr updates and stored in a distributed Solr configuration (SolrCloud) to ensure scalability of the indexing Image credit: NGData

Application Pattern 9 Near-real-time indexing of HBase mutations Interactive exploration with search tools Ingest: [HDFS, HBase, Kafka] Materialization: [HDFS, HBase] MapReduce Solr Computation: [HBase, Kafka] [Storm, Samza, MapReduce] HBase Solr Update: User HBase Solr Query: Solr User Data

End