Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Similar documents
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Apache Hadoop FileSystem and its Usage in Facebook

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Design and Evolution of the Apache Hadoop File System(HDFS)

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

How To Scale Out Of A Nosql Database

Hadoop & its Usage at Facebook

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop & its Usage at Facebook

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

NoSQL Data Base Basics

Building Mission Critical Messaging System On Top Of HBase

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop IST 734 SS CHUNG

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

How To Use Big Data For Telco (For A Telco)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Application Development. A Paradigm Shift

Big Data With Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

MapReduce with Apache Hadoop Analysing Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Big Data Analytics - Accelerated. stream-horizon.com

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Architecture. Part 1

Entering the Zettabyte Age Jeffrey Krone

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Apache HBase. Crazy dances on the elephant back

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Integrating Big Data into the Computing Curricula

Scalable Architecture on Amazon AWS Cloud

Big Fast Data Hadoop acceleration with Flash. June 2013

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

A Survey of Distributed Database Management Systems

Google Bing Daytona Microsoft Research

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Using distributed technologies to analyze Big Data

Apache Hadoop FileSystem Internals

Criteria to Compare Cloud Computing with Current Database Technology

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Open source large scale distributed data management with Google s MapReduce and Bigtable

In Memory Accelerator for MongoDB

NoSQL and Hadoop Technologies On Oracle Cloud

Apache Hadoop. Alexandru Costan

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Alternatives to HIVE SQL in Hadoop File Structure

APACHE HADOOP JERRIN JOSEPH CSU ID#

Hadoop implementation of MapReduce computational model. Ján Vaňo

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Benchmarking Hadoop & HBase on Violin

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Big Data and Scripting Systems build on top of Hadoop

Open source Google-style large scale data analysis with Hadoop

Apache Hadoop Goes Realtime at Facebook

The Hadoop Distributed File System

Hadoop: Embracing future hardware

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data Technology Core Hadoop: HDFS-YARN Internals

Internals of Hadoop Application Framework and Distributed File System

Peers Techno log ies Pv t. L td. HADOOP

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CSE-E5430 Scalable Cloud Computing Lecture 2

Can the Elephants Handle the NoSQL Onslaught?

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HADOOP MOCK TEST HADOOP MOCK TEST II

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Challenges for Data Driven Systems

COURSE CONTENT Big Data and Hadoop Training

Large scale processing using Hadoop. Ján Vaňo

Hadoop Ecosystem B Y R A H I M A.

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Apache Hadoop: Past, Present, and Future

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hadoop Distributed File System (HDFS) Overview

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Comparing SQL and NOSQL databases

Transcription:

Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at Facebook

Why Hadoop and HBase? For Realtime Data?

Problems with existing stack MySQL is stable, but... Not inherently distributed Table size limits Inflexible schema Hadoop is scalable, but... MapReduce is slow and difficult Does not support random writes Poor support for random reads

Specialized solutions High-throughput, persistent key-value Tokyo Cabinet Large scale data warehousing Hive/Hadoop Photo Store Haystack Custom C++ servers for lots of other stuff

What do we need in a data store? Requirements for Facebook Messages Massive datasets, with large subsets of cold data Elasticity and high availability Strong consistency within a datacenter Fault isolation Some non-requirements Network partitions within a single datacenter Active-active serving from multiple datacenters

HBase satisfied our requirements In early 2010, engineers at FB compared DBs Apache Cassandra, Apache HBase, Sharded MySQL Compared performance, scalability, and features HBase gave excellent write performance, good reads HBase already included many nice-to-have features Atomic read-modify-write operations Multiple shards per server Bulk importing

HBase uses HDFS We get the benefits of HDFS as a storage system for free Fault tolerancescalabilitychecksums fix corruptionsmapreduce Fault isolation of diskshdfs battle tested at petabyte scale at Facebook Lots of existing operational experience

Apache HBase Originally part of Hadoop HBase adds random read/write access to HDFS Required some Hadoop changes for FB usage File appends HA NameNode Read optimizations Plus ZooKeeper!

HBase System Overview Database Layer Master Backup Master Storage Layer Region Server Region Server Region Server HBASE... Coordination Service Namenode Datanode Secondary Namenode Datanode Datanode HDFS... ZK Peer ZK Peer... Zookeeper Quorum

HBase in a nutshell Sorted and column-oriented High write throughputhorizontal scalability Automatic failoverregions sharded dynamically

Applications of HBase at Facebook

Use Case 1 Titan (Facebook Messages)

The New Facebook Messages Messages IM/Chat email SMS

Facebook Messaging High write throughputevery message, instant message, SMS, and e-mailsearch indexes for all of the above Denormalized schema A product at massive scale on day one 6k messages a second 50k instant messages a second 300TB data growth/month compressed

Typical Cell Layout Multiple cells for messaging 20 servers/rack; 5 or more racks per cluster Controllers (master/zookeeper) spread across racks ZooKeeper HDFS NameNode Region Server Data Node Task Tracker ZooKeeper Backup NameNode Region Server Data Node Task Tracker ZooKeeper Job Tracker Region Server Data Node Task Tracker ZooKeeper HBase Master Region Server Data Node Task Tracker ZooKeeper Backup Master Region Server Data Node Task Tracker 19x... 19x... 19x... 19x... 19x... Region Server Data Node Task Tracker Rack #1 Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Rack #2 Rack #3 Rack #4 Rack #5

High Write Throughput Write Key Value Key val Key val Key val. Key val. Key val Key val Key val. Sorted in memory Key val Sequential write Key val Sequential write Key val Commit Log(in HDFS) Memstore (in memory)

Horizontal Scalability Region......

Automatic Failover HBase client Find new server from META server died No physical data copy because data is in HDFS

Use Case 2 Puma (Facebook Insights)

Puma Realtime Data Pipeline Utilize existing log aggregation pipeline (Scribe- HDFS) Extend low-latency capabilities of HDFS (Sync+PTail) High-throughput writes (HBase) Support for Realtime Aggregation Utilize HBase atomic increments to maintain roll-ups Complex HBase schemas for unique-user calculations

Puma as Realtime MapReduce Map phase with PTail Divide the input log stream into N shards First version only supported random bucketing Now supports application-level bucketing Reduce phase with HBase Every row+column in HBase is an output key Aggregate key counts using atomic counters Can also maintain per-key lists or other structures

Puma for Facebook Insights Realtime URL/Domain Insights Domain owners can see deep analytics for their site Clicks, Likes, Shares, Comments, Impressions Detailed demographic breakdowns (anonymized) Top URLs calculated per-domain and globally Massive Throughput Billions of URLs > 1 Million counter increments per second

Use Case 3 ODS (Facebook Internal Metrics)

ODS Operational Data Store System metrics (CPU, Memory, IO, Network) Application metrics (Web, DB, Caches) Facebook metrics (Usage, Revenue) Easily graph this data over time Supports complex aggregation, transformations, etc. Difficult to scale with MySQL Millions of unique time-series with billions of points Irregular data growth patterns

Dynamic sharding of regions Region...... server overloaded

Future of HBase at Facebook

User and Graph Data in HBase

HBase for the important stuff Looking at HBase to augment MySQL Only single row ACID from MySQL is used DBs are always fronted by an in-memory cache HBase is great at storing dictionaries and lists Database tier size determined by IOPS HBase does only sequential writes Lower IOPs translate to lower cost Larger tables on denser, cheaper, commodity nodes

Conclusion Facebook investing in Realtime Hadoop/HBase Work of a large team of Facebook engineers Close collaboration with open source developers Much more detail in Realtime Hadoop paper Technical details about changes to Hadoop and HBase Operational experiences in production

Questions? jgray@fb.com dhruba@fb.com