Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop



Similar documents
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehouse Overview. Namit Jain

Hive User Group Meeting August 2009

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Architecture and its Usage at Facebook

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Hadoop & its Usage at Facebook

Hadoop and Map-Reduce. Swati Gore

Hadoop & its Usage at Facebook

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Cloudera Certified Developer for Apache Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Open source Google-style large scale data analysis with Hadoop

COURSE CONTENT Big Data and Hadoop Training

Using distributed technologies to analyze Big Data

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Complete Java Classes Hadoop Syllabus Contact No:

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Introduction to Apache Hive

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Best Practices for Hadoop Data Analysis with Tableau

Big Data With Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Peers Techno log ies Pv t. L td. HADOOP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Open source large scale distributed data management with Google s MapReduce and Bigtable

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hive A Petabyte Scale Data Warehouse Using Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop IST 734 SS CHUNG

Constructing a Data Lake: Hadoop and Oracle Database United!

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Alternatives to HIVE SQL in Hadoop File Structure

CSE-E5430 Scalable Cloud Computing Lecture 2

Native Connectivity to Big Data Sources in MSTR 10

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to Apache Hive

Big Data Course Highlights

Introduction to Hadoop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Workshop on Hadoop with Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CS54100: Database Systems

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop: The Definitive Guide

Can the Elephants Handle the NoSQL Onslaught?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HDFS. Hadoop Distributed File System

Hadoop Job Oriented Training Agenda

Internals of Hadoop Application Framework and Distributed File System

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

HADOOP PERFORMANCE TUNING

CS 378 Big Data Programming

NoSQL for SQL Professionals William McKnight

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

APACHE HADOOP JERRIN JOSEPH CSU ID#

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Cost-Effective Business Intelligence with Red Hat and Open Source

How To Scale Out Of A Nosql Database

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

MapReduce. Tushar B. Kute,

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data and Scripting map/reduce in Hadoop

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Testing Big data is one of the biggest

Hadoop implementation of MapReduce computational model. Ján Vaňo

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

A Performance Analysis of Distributed Indexing using Terrier

Networking in the Hadoop Cluster

CitusDB Architecture for Real-Time Big Data

Big Data on Microsoft Platform

Architectures for Big Data Analytics A database perspective

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Transcription:

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today

Trends Leading to More Data

Trends Leading to More Data Free or low cost of user services

Trends Leading to More Data Free or low cost of user services Realization that more insights are derived from simple algorithms on more data

Deficiencies of Existing Technologies

Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data

Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Limited Scalability does not support trends towards more data

Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Limited Scalability does not support trends towards more data Closed and Proprietary Systems

Lets try Hadoop Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE

What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata in an RDBMS Key Building Principles: SQL as a familiar data warehousing tool Extensibility Types, Functions, Formats, Scripts Scalability and Performance Interoperability

Why SQL on Hadoop? hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 - mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/ largekey -numreducetasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*

Data Flow Architecture at Facebook

Data Flow Architecture at Facebook Web Servers

Data Flow Architecture at Facebook Web Servers Scribe MidTier

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Federated MySQL

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Production Hive-Hadoop Cluster Federated MySQL

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Oracle RAC Production Hive-Hadoop Cluster Federated MySQL

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Hive replication Scribe-Hadoop Cluster Oracle RAC Production Hive-Hadoop Cluster Federated MySQL

Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Hive replication Adhoc Hive-Hadoop Cluster Scribe-Hadoop Cluster Oracle RAC Production Hive-Hadoop Cluster Federated MySQL

Scribe & Hadoop Clusters @ Facebook Used to log data from web servers Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: ~ 25TB/day of raw data logged 99% of the time data is available within 20 seconds

Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 8400 cores Raw Storage capacity ~ 12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs

Hive & Hadoop Usage @ Facebook Statistics per day: 12 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs per day 80K compute hours per day Hive simplifies Hadoop: New engineers go though a Hive training session ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs

Hive & Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others

More about HIVE

Data Model Name HDFS Directory Table pvs /wh/pvs Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=us Bucket user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds=20090801/ctry=us/ part-00000

Hive Query Language SQL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert Multi-group-by Embedding Custom Map/Reduce in SQL Sampling Primitive Types integer types, float, string, boolean Nestable Collections array<any-type> and map<primitive-type, any-type> User-defined types Structures with attributes which can be of any-type

Optimizations Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew

Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, Different serialization formats and data types LazySimpleSerDe, ThriftSerDe User-provided map/reduce scripts In any language, use stdin/stdout to transfer data User-defined Functions Substr, Trim, From_unixtime User-defined Aggregation Functions Sum, Average User-define Table Functions Explode

Existing File Formats TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel.

Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (MAP uhash, page_url, unix_time USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 REDUCE uhash, page_id, unix_time USING 'my_python_session_cutter.py' AS (uhash, session_info);

UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.udftestlength'; SELECT testlength(page_url) FROM mylog; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(string s) { } } if (s == null) { } return null; return s.length();

Comparison of UDF/UDAF/UDTF v.s. M/R scripts UDF/UDAF/UDTF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster slower

Interoperability: Interfaces JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI

Interoperability: Microstrategy Beta integration with version 8 Free form SQL support Periodically pre-compute the cube

Operational Aspects on Adhoc cluster Data Discovery cohive Discover tables Talk to expert users of a table Browse table lineage Monitoring Resource utilization by individual, project, group SLA monitoring etc. Bad user reports etc.

HiPal & CoHive (Not open source)

Open Source Community Released Hive-0.4 on 10/13/2009 50 contributors and growing 11 committers 3 external to Facebook Available as a sub project in Hadoop - http://wiki.apache.org/hadoop/hive (wiki) - http://hadoop.apache.org/hive (home page) - http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) - ##hive (IRC) - Works with hadoop-0.17, 0.18, 0.19, 0.20 Mailing Lists: hive-{user,dev,commits}@hadoop.apache.org

Powered by Hive