Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Similar documents
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Apache Hadoop FileSystem and its Usage in Facebook

Design and Evolution of the Apache Hadoop File System(HDFS)

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Apache Hadoop FileSystem Internals

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Architecture and its Usage at Facebook

Large scale processing using Hadoop. Ján Vaňo

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop: Embracing future hardware

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Open source Google-style large scale data analysis with Hadoop

CitusDB Architecture for Real-Time Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Using distributed technologies to analyze Big Data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop IST 734 SS CHUNG

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Can the Elephants Handle the NoSQL Onslaught?

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Database Scalability and Oracle 12c

How To Use Big Data For Telco (For A Telco)

Benchmarking Hadoop & HBase on Violin

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Application Development. A Paradigm Shift

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA What it is and how to use?

Unified Big Data Processing with Apache Spark. Matei

How To Scale Out Of A Nosql Database

Large-Scale Data Processing

bla bla OPEN-XCHANGE Open-Xchange Hardware Needs

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

How To Handle Big Data With A Data Scientist

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Inge Os Sales Consulting Manager Oracle Norway

BIG DATA TRENDS AND TECHNOLOGIES

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Ecosystem B Y R A H I M A.

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Moving From Hadoop to Spark

Google File System. Web and scalability

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

THE HADOOP DISTRIBUTED FILE SYSTEM

Data Warehouse Overview. Namit Jain

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Quantcast Petabyte Storage at Half Price with QFS!

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

NoSQL for SQL Professionals William McKnight

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Fast Data Hadoop acceleration with Flash. June 2013

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop & Spark Using Amazon EMR

Bigtable is a proven design Underpins 100+ Google services:

Future Prospects of Scalable Cloud Computing

Constructing a Data Lake: Hadoop and Oracle Database United!

Accelerating and Simplifying Apache

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hypertable Architecture Overview

Integrating Apache Spark with an Enterprise Data Warehouse

Parallel Replication for MySQL in 5 Minutes or Less

Architectures for Big Data Analytics A database perspective

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data With Hadoop

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Case Study : 3 different hadoop cluster deployments

Transcription:

Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4 Why Hive?

Four major types of storage systems Online Transaction Processing Databases (OLTP) The Facebook Social Graph Semi-online Light Transaction Processing Databases (SLTP) Facebook Messages and Facebook Time Series Immutable DataStore Photos, videos, etc Analytics DataStore Data Warehouse, Logs storage

Size and Scale of Databases Total Size Technology Bottlenecks Facebook Graph Single digit petabytes MySQL and TAO Random read IOPS Facebook Messages and Time Series Data Tens of petabytes HBase and HDFS Write IOPS and storage capacity Facebook Photos High tens of petabytes Haystack storage capacity Data Warehouse Hundreds of petabytes Hive, HDFS and Hadoop storage capacity

Characteristics Query Latency Consistency Durability Facebook Graph Facebook Messages and Time Series Data Facebook Photos < few milliseconds < 200 millisec quickly consistent across data centers consistent within a data center No data loss No data loss < 250 millisec immutable No data loss Data Warehouse < 1 min not consistent across data centers No silent data loss

Facebook Graph: Objects and Associations

Data model Objects & Associations likes 8636146 (user) liked by fan friend 18429207554 (page) admin friend 604191769 (user) name: Barack Obama birthday: 08/04/1961 website: http:// verified: 1 6205972929 (story)

Facebook Social Graph: TAO and MySQL An OLTP workload: Uneven read heavy workload Huge working set with creation-time locality Highly interconnected data Constantly evolving As consistent as possible

Data model Content aware data store Allows for server-side data processing Can exploit creation-time locality Graph data model Nodes and Edges : Objects and Associations Restricted graph API

Data model Objects & Associations Object -> unique 64 bit ID plus a typed dictionary (id) -> (otype, (key -> value)* ) ID 6815841748 -> { type : page, name : Barack Obama, } Association -> typed directed edge between 2 IDs (id1, atype, id2) -> (time, (key -> value)* ) (8636146, RSVP, 130855887032173) -> (1327719600, { response : YES }) Association lists (id1, atype) -> all assocs with given id1, atype in desc order by time

Data model API Object : (id) -> (otype, (key -> value)* ) obj_add(otype, (k->v)*) : creates new object, returns its id obj_update(id, (k->v)*) : updates some or all fields obj_delete(id): removes the object permanently obj_get(id) : returns type and fields of a given object if exists

Data model API Association : (id1, atype, id2) -> (time, (key -> value)* ) assoc_add(id1, atype, id2, time, (k->v)*) : adds/updates the given assoc assoc_delete(id1, atype, id2) : deletes the given association

Data model API Association : (id1, atype, id2) -> (time, (key -> value)* ) assoc_get(id1, atype, id2set) : returns assocs where id2 id2set assoc_range(id1, atype, offset, limit, filters*): get relevant matching assocs from the given assoc list assoc_count(id1, atype): returns size of given assoc list

Architecture Cache & Storage Web servers TAO Storage Cache MySQL Storage

Architecture Sharding Object ids and Assoc id1s are mapped to shard ids Web Servers TAO Cache s1 s3 s5 s2 s6 s4 s7 s8 MySQL Storage db2 db4 db1 db3 db8 db7 db5 db6

Workload Read-heavy workload Significant range queries LinkBench benchmark being open-sourced http://www.github.com/facebook/linkbench Real data distribution of Assocs and their access patterns

Messages & Time Series Database SLTP workload

Facebook Messages Messages Chats Emails SMS

Why we chose HBase High write throughput Horizontal scalability Automatic Failover Strong consistency within a data center Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset, Why is this SLTP? Semi-online: Queries run even if part of the database is offline Light Transactions: single row transactions Storage capacity bound rather than iops or cpu bound

What we store in HBase Small messages Message metadata (thread/message indices) Search index Large attachments stored in Haystack (photo store)

Size and scale of Messages Database 6 Billion messages/day 74 Billion operations/day At peak: 1.5 million operations/sec 55% read, 45% write operations Average write operation inserts 16 records All data is lzo compressed Growing at 8 TB/day

Haystack: The Photo Store

Facebook Photo DataStore 2009 2012 Total Size 15 billion photos 1.5 Petabyte High tens of petabytes Upload Rate 30 million photos/day 3 TB/day 300 million photos/day 30 TB/day Serving Rate 555K images/sec

Haystack based Design Haystack Directory Haystack Store Web Server Haystack Cache Browser CDN

Haystack Internals Log structured, append-only object store Built on commodity hardware Application-aware replication Images stored in 100 GB xfs-files called needles An in-memory index for each needle file 32 bytes of index per photo

Hive Analytics Warehouse

Life of a photo tag in Hadoop/Hive storage Periodic Analysis (HIVE) Daily report on count of photo tags by country (1day) nocron hipal Adhoc Analysis (HIVE) Count photos tagged by females age 20-25 yesterday Hive Warehouse Scrapes User info reaches Warehouse (1day) MySQL DB copier/loader Log line reaches warehouse (15 min) User tags a photo puma RealHme AnalyHcs (HBASE) Count users tagging photos in the last hour (1min) Scribe Log Storage (HDFS) Log line reaches Scribeh (10s) www.facebook.com Log line generated: <user_id, photo_id>

Analytics Data Growth(last 4 years) Facebook Users Queries/Day Scribe Data/ Day Nodes in warehouse Size (Total) Growth 14X 60X 250X 260X 2500X

Why use Hive instead of a Parallel DBMS? Stonebraker/DeWitt from the DBMS community: Quote major step backwards Published benchmark results which show that Hive is not as performant as a traditional DBMS http://database.cs.brown.edu/projects/mapreduce-vs-dbms/

What is BigData? Prospecting for Gold.. Finding Gold in the wild-west A platform for huge data-experiments A majority of queries are searching for a single gold nugget Great advantage in keeping all data in one queryable system No structure to data, specify structure at query time

How to measure performance Traditional database systems: Latency of queries Big Data systems: How much data can we store and query? (the Big in BigData) How much data can we query in parallel? What is the value of this system?

Measure Cost of Storage Distributed Network Encoding of data Encoding is better than replication Use algorithms that minimize network transfer for data repair Tradeoff cpu for storage & network Remember lineage of data, e.g. record query that created it If data is not accessed for sometime, delete it If a query occurs, recompute the data using query lineage

Measure Network Encoding Start the same: triplicate every data block Background encoding Combine third replica of blocks from a single file to create parity block A A A B B B C C C Remove third replica Reed Solomon encoding for much older files A+B+C A file with three blocks A, B and C (XOR Encoding) http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html

Measuring Data Discovery: Crowd Sourcing There are 50K tables in a single warehouse Users are Data Adminstrators themselves Questions about a table are directed to users of that table Automatic query lineage tools

Measuring Testabilty Traditional systems Recreate load using tests Validate results Big Data Systems Cannot replicate production load on test environment Deploy new service on a small percentage of service Monitor metrics Rolling upgrades Gradually deploy to larger section of service

Fault Tolerance and Elasticity Commodity machines Faults are the norm Anomalous behavior rather than complete failures 10% of machines are always 50% slower than the others

Measuring Fault Tolerance and Elasticity Fault tolerance is a must Continuously kill machines during benchmarking Slow down 10% of machine during benchmark Elasticity is necessary Add/remove new machines during benchmarking

Measuring Value of the System cost /GB decreasing with time So users can store more data But users need a metric to determine whether this cost is worth it What is the VALUE of this system? A metric that aids the user (and not the service provider)

Value per Byte (VB) for the System A new metric named VB Compare differences in value over time If VB increases with time, then user is satisfied You touch a byte, its VB is MAX (say 100) System VB = weighted sum of VB of each byte in the system

VB Even a turtle ages with time The VB decreases with time A more recent access has more value than an older access Different ageing models (linear, exponential)

Why use Hive instead of a Parallel DBMS? Stonebraker/DeWitt from the DBMS community: Quote Hadoop is a major step backwards Published benchmark results which show that Hadoop/Hive is not as performant as a traditional DBMS http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ Hive query is 50 times slower than DBMS query Stonebraker s Conclusion: Facebook s 4000 node cluster (100PB) can be replaced by a 20 node DBMS cluster What is wrong with the above conclusion?

Hive/Hadoop instead of Parallel DBMS Dr Stonebraker s proposal would put 5 PB per node on DBMS What will be the io throughput of that system? Abysmal How many concurrent queries can it support? Certainly not 100K concurrent clients He is using a wrong metric to make a conclusion Hive/Hadoop is very very slow Hive/Hadoop needs to be fixed to reduce query latency But an existing DBMS cannot replace Hive/Hadoop

Future Challenges

New trends in storage software Trends: SSDs cheaper, increasing number of CPUs per server SATA disk capacities reaching 4-8 TB per disk, falling prices $/GB New projects Evaluate OLTP databases that scales linearly with the number of cps Prototype storing cold photos on spin-down disks

Questions? dhruba@gmail.com http://hadoopblog.blogspot.com/