Data Warehouse Overview. Namit Jain



Similar documents
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hive User Group Meeting August 2009

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture and its Usage at Facebook

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Big Data With Hadoop

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop & its Usage at Facebook

Using distributed technologies to analyze Big Data

Hadoop & its Usage at Facebook

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Alternatives to HIVE SQL in Hadoop File Structure

Apache Hadoop. Alexandru Costan

Apache Hadoop FileSystem and its Usage in Facebook

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop and Map-Reduce. Swati Gore

NoSQL and Hadoop Technologies On Oracle Cloud

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop IST 734 SS CHUNG

COURSE CONTENT Big Data and Hadoop Training

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter 7. Using Hadoop Cluster and MapReduce

Large scale processing using Hadoop. Ján Vaňo

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Cloudera Certified Developer for Apache Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CitusDB Architecture for Real-Time Big Data

Introduction to Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Apache HBase. Crazy dances on the elephant back

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop Job Oriented Training Agenda

Open source Google-style large scale data analysis with Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Design and Evolution of the Apache Hadoop File System(HDFS)

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

A very short Intro to Hadoop

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Hadoop Distributed File System

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Weather Analytics Using Hadoop

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Chase Wu New Jersey Ins0tute of Technology

Hadoop: The Definitive Guide

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Internals of Hadoop Application Framework and Distributed File System

Workshop on Hadoop with Big Data

BIG DATA What it is and how to use?

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Introduction to cloud computing

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Open source large scale distributed data management with Google s MapReduce and Bigtable

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to MapReduce and Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Extending Hadoop beyond MapReduce

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

<Insert Picture Here> Big Data

Google Bing Daytona Microsoft Research

Hadoop Ecosystem B Y R A H I M A.

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

The Hadoop Eco System Shanghai Data Science Meetup

Transcription:

Data Warehouse Overview Namit Jain

Agenda Why data? Life of a tag for data infrastructure Warehouse architecture Challenges Summarizing

Data Science peace.facebook.com Friendships on Facebook

Data Science - facebook.com/data Gross National Happiness

Data Analyses

Data-enhanced Products People You May Know (PYMK) Newsfeed ranking Ads optimization Index building for search

External Reporting Social Plugin Insights

Internal Reporting Product Insights Data-driven product development Allows products to iterate quickly by observing user behavior

Life of a tag for data infrastructure

Facebook Architecture (Simplified)

Facebook Architecture Data Sources Log Data (facts) Web-tier user activity logs View/click of an ad, liking a story, fanning a page, status update, Backend Services - Search, Newsfeed, Ads Facebook-Site related Data (dimensions) MySQL Descriptions of ads User demographics

Life of a tag for data infrastructure Periodic Analysis Adhoc Analysis Daily report on count of photo tags by country (1day) nocron hipal Count photos tagged by females age 20-25 yesterday Scrapes Warehouse User info reaches Warehouse (1day) MySql copier/loader Log line reaches warehouse (1hr) User tags a photo Real?me Analy?cs Count users tagging photos in the last hour (1min) puma Scribe Log Storage Log line reaches Scribeh (1s) www.facebook.com Log line generated: <user_id, photo_id>

Takeaways Log collection Realtime analysis Batch analysis Periodic analysis Interactive analysis

Takeaways Scribe/Calligraphus Puma/HBase Hive/Hadoop Databee/Chronos

Takeaways Open Source Scribe HBase Hive/Hadoop

Scribe Open Source, simple and scalable log collection system Web Tier Mid- Tier Warehouse

Challenges: Choosing the right stack? Hadoop/ Hive Oracle/ AsterData Sharded MySQL Cost Availability Scalability Performance ACID Ease of Use

Warehouse Architecture

Warehouse Architecture Storage (HDFS)

Warehouse Architecture Compute (MapReduce) Storage (HDFS)

Warehouse Architecture Compute (MapReduce) Storage (HDFS) Hadoop

Warehouse Architecture Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop

Warehouse Architecture Workflow (Nocron) Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop

What is Hadoop: Open Source Apache project Framework for running applications on large clusters of commodity hardware Scale: petabytes of data on thousands of nodes Hadoop layers: Storage layer: HDFS Processing layer: MapReduce Characteristics: Uses clusters of commodity computers Supports moving computation close to data Single storage + compute cluster vs. Separate clusters Scalable, fault tolerant, and easily managed But, not easy to program compared to databases(sql)

HDFS Data Model Data is logically organized into files and directories Files are divided into uniform-sized blocks Blocks are distributed across the nodes of the cluster and are replicated to handle hardware failure HDFS keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data 25

HDFS Architecture Client Metadata ops Namenode Metadata ops Metadata (Name, #replicas, ): /users/foo/data, 3, Block ops Read Datanodes Datanodes Replication Blocks Rack 1 Write Rack 2 Client 26

MapReduce Review - WordCount

Warehouse Hadoop Storage/Compute

Hive Aim to simplify usage of Hadoop A system for managing and querying structured and semistructured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on HDFS files Key Building Principles SQL is a familiar language Extensibility Types, Functions, Formats, Scripts Performance

Hive Simplifying usage of Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.23-dev-streaming.jar -input / user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh - file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey - numreducetasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*

Hive Architecture

Hive Data/Query Model Looks and behaves almost like a regular database Data Model Tables with typed columns Flexible types and storage formats Query Model Flavor of SQL for analytics queries Extensible via user defined functions and custom map/reduce scripts

Data Model Hive Entity Sample Metastore Entity Sample HDFS Location Table T /wh/t Partition date=d1 /wh/t/date=d1 Bucketing column External Table userid extt /wh/t/date=d1/part-0000 /wh/t/date=d1/part-1000 (hashed on userid) /wh2/existing/dir (arbitrary location)

Data Model Tables Analogous to tables in relational DBs Each table has corresponding directory in HDFS Example Page views table name: pvs HDFS directory /wh/pvs

Data Model Partitions Analogous to dense indexes on partition columns Nested sub-directories in HDFS for each combination of partition column values Example Partition columns: ds, ctry HDFS subdirectory for ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=us HDFS subdirectory for ds = 20090801, ctry = CA /wh/pvs/ds=20090801/ctry=ca

Data Model Buckets Split data based on hash of a column - mainly for parallelism One HDFS file per bucket within partition sub-directory Example Bucket column: user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds=20090801/ctry=us/part-00000 HDFS file for user hash bucket 20 /wh/pvs/ds=20090801/ctry=us/part-00020

Data Model External Tables Point to existing data directories in HDFS Can create tables and partitions partition columns just become annotations to external directories Example: create external table with partitions CREATE EXTERNAL TABLE pvs(userid int, pageid int, ds string, ctry string) PARTITIONED ON (ds string, ctry string) STORED AS textfile LOCATION /path/to/existing/table Example: add a partition to external table ALTER TABLE pvs ADD PARTITION (ds= 20090801, ctry= US ) LOCATION /path/to/existing/partition

Example Application Status updates table: status_updates(userid int, status string, ds string) Load the data from log files: LOAD DATA LOCAL INPATH /logs/status_updates INTO TABLE status_updates PARTITION (ds= 2009-03-20 ) User profile table profiles(userid int, school string, gender int)

Example Query Plan (Filter) Filter status updates containing michael jackson SELECT * FROM status_updates WHERE status LIKE michael jackson

Example Query Plan (Aggregation) Figure out total number of status_updates in a given day SELECT COUNT(1) FROM status_updates WHERE ds = 2009-08-01

Hive Query Language Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Complex object types: List of Maps Pluggable Data Formats Apache Log Format Columnar Storage Format

Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture Nearly 100% of hadoop jobs in the warehouse go through Hive. TRANSFORM scripts (any language) Serialization+IPC overhead Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters

Hive is an open system Different on-disk data formats Text File, Sequence File, Different in-memory data formats Java Integer/String, Hadoop IntWritable/Text User-provided map/reduce scripts In any language, use stdin/stdout to transfer data User-defined Functions Substr, Trim, From_unixtime User-defined Aggregation Functions Sum, Average

File Format Example CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE; LOAD DATA INPATH '/user/myname/log.txt' INTO TABLE mylog;

Existing File Formats TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel.

SerDe Examples CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; CREATE table mylog_rc ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.columnarserde' STORED AS RCFILE;

SerDe SerDe is short for serialization/deserialization. It controls the format of a row. Serialized format: Delimited format (tab, comma, ctrl-a ) Thrift Protocols Deserialized (in-memory) format: Java Integer/String/ArrayList/HashMap Hadoop Writable classes User-defined Java Classes (Thrift)

Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);

Comparison of UDF/UDAF v.s. M/R scripts UDF/UDAF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster Slower

Common Join Task A Table X Common Join Task Table Y Mapper Mapper Mapper Mapper Mapper Mapper Shuffle Reducer

Join in Map Reduce page_view key value key value pageid userid time 111 <1,1> 111 <1,1> 1 111 9:08:01 2 111 9:08:13 111 <1,2> 111 <1,2> 1 222 9:08:14 user Map 222 <1,1> Shuffle Sort 111 <2,25> Reduce userid age gender key value key value 111 25 female 222 32 male 111 <2,25> 222 <2,32> 222 <1,1> 222 <2,32>

Auto Map-Join

Auto Map-Join

Auto Map-Join

Bucketized Map-Join

Sort Merge Bucket Map-Join

Hive but is not Enough! Workflow specification, schedule and execution framework Workflows are DAGS Nodes are data transfers and transformations Edges are dependencies between nodes Reporting and Dashboard Tools HiveQuery/Workflow Authoring Tools Warehouse management Track space and cpu usage of the cluster Capacity planning for growth

Warehouse Challenges

Warehouse Challenges Growth Data, data, and more data

Growth Numbers Facebook Users (million) Queries/ Day Scribe Data GB/ Day Nodes Size TB (Total) March 2008 March 2012 Growth 14X 60X 250X 260X 2500X

HDFS Normal Deployment NameNode Data Node 1 Data Node 2 Data Node 3

First Attempts Concatenate old tables/partitions Alter table partition <p> concatenate No need to compress/uncompress the data for RCFile Hadoop Archive File Needed for bucketed files Upgrade Namenode

HDFS Hacked Federation NN1 NN2 DN1 DN1 DN2 DN2 DN3 DN3

HDFS - Federated Deployment NameNode1 NameNode2 Data Node 1 Data Node 2 Data Node 3

HDFS Layout NEW Map Reduce HDFS Cluster with mul?ple Name Nodes

Corona Hive Query Task Tracker M Hive CLI + Job Client Job Tracker heartbeat Task Tracker Task Tracker R

Hadoop Corona Split the current Job Tracker Cluster Manager to manage resources/nodes One Corona Job Tracker per job Corona Job Tracker requests resources from Cluster Manager Small amount of state in Cluster Manager Can restart

Corona Hive Query Cluster Manager Task Tracker M Hive CLI + Job Client + Job Tracker heartbeat Task Tracker Task Tracker R

Warehouse Challenges Growth Isolation Space Isolation Compute Isolation Failure Isolation

Isolation - Now Hardware isolation Platinum cluster & Silver cluster Partial compute isolation Pools Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster

Challenges: Isolation Replica?on Pla3num Silver

Isolation Pools FIFO within each pool

TEAM Minimum Slots ADS BI COEFFICIENT GROWTH SCRAPING INSIGHTS NETEGO PLATFORM

Isolation - Future Logical namespace per team Namespace encompasses Transport capacity (scribe) Realtime analytics capacity (puma) Storage capacity (hive tables) Compute capacity (periodic/adhoc analyses) Resource accountability per namespace Pools computed dynamically

Isolation - Future NEW NS1 NS2 NS3 Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster

Challenges: Testing Shadow testing with multiple DFS and MR clusters SILVER BRONZE DFS1 DFS5 DFSTEMP

Testing Snapshot cluster Queries for a day Track cpu/byte for top 100 queries

Warehouse Challenges Growth Isolation Multiple Regions Hadoop picky about new capacity requirements Need to use any capacity in any location Need to share data between regions

Multi Region Hive1 Map Reduce Replica?on HDFS

Project Prism Hive1 NS1 NS2 NS3 Replica?on Pool1 Pool2 Pool3 HDFS Cluster Central Namespace Server Replica?on Replica?on

Interactive Query - Peregrine

Peregrine Fast Approximate results Memory bound No Join/sub-query support

Open Source Hadoop Facebook has its internal branch Releases to github periodically Hive Development is in apache Pulls into internal branch periodically

Hive Open Projects Testing Benchmark Data Generator

Hive Open Projects Performance Materialized Views Cost-based optimizer for Hive Index Joins Better skew handling techniques Map-reduce-reduce-reduce* Hash without sort on map-reduce boundary

Questions?