Data Warehouse Overview Namit Jain
Agenda Why data? Life of a tag for data infrastructure Warehouse architecture Challenges Summarizing
Data Science peace.facebook.com Friendships on Facebook
Data Science - facebook.com/data Gross National Happiness
Data Analyses
Data-enhanced Products People You May Know (PYMK) Newsfeed ranking Ads optimization Index building for search
External Reporting Social Plugin Insights
Internal Reporting Product Insights Data-driven product development Allows products to iterate quickly by observing user behavior
Life of a tag for data infrastructure
Facebook Architecture (Simplified)
Facebook Architecture Data Sources Log Data (facts) Web-tier user activity logs View/click of an ad, liking a story, fanning a page, status update, Backend Services - Search, Newsfeed, Ads Facebook-Site related Data (dimensions) MySQL Descriptions of ads User demographics
Life of a tag for data infrastructure Periodic Analysis Adhoc Analysis Daily report on count of photo tags by country (1day) nocron hipal Count photos tagged by females age 20-25 yesterday Scrapes Warehouse User info reaches Warehouse (1day) MySql copier/loader Log line reaches warehouse (1hr) User tags a photo Real?me Analy?cs Count users tagging photos in the last hour (1min) puma Scribe Log Storage Log line reaches Scribeh (1s) www.facebook.com Log line generated: <user_id, photo_id>
Takeaways Log collection Realtime analysis Batch analysis Periodic analysis Interactive analysis
Takeaways Scribe/Calligraphus Puma/HBase Hive/Hadoop Databee/Chronos
Takeaways Open Source Scribe HBase Hive/Hadoop
Scribe Open Source, simple and scalable log collection system Web Tier Mid- Tier Warehouse
Challenges: Choosing the right stack? Hadoop/ Hive Oracle/ AsterData Sharded MySQL Cost Availability Scalability Performance ACID Ease of Use
Warehouse Architecture
Warehouse Architecture Storage (HDFS)
Warehouse Architecture Compute (MapReduce) Storage (HDFS)
Warehouse Architecture Compute (MapReduce) Storage (HDFS) Hadoop
Warehouse Architecture Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop
Warehouse Architecture Workflow (Nocron) Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop
What is Hadoop: Open Source Apache project Framework for running applications on large clusters of commodity hardware Scale: petabytes of data on thousands of nodes Hadoop layers: Storage layer: HDFS Processing layer: MapReduce Characteristics: Uses clusters of commodity computers Supports moving computation close to data Single storage + compute cluster vs. Separate clusters Scalable, fault tolerant, and easily managed But, not easy to program compared to databases(sql)
HDFS Data Model Data is logically organized into files and directories Files are divided into uniform-sized blocks Blocks are distributed across the nodes of the cluster and are replicated to handle hardware failure HDFS keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data 25
HDFS Architecture Client Metadata ops Namenode Metadata ops Metadata (Name, #replicas, ): /users/foo/data, 3, Block ops Read Datanodes Datanodes Replication Blocks Rack 1 Write Rack 2 Client 26
MapReduce Review - WordCount
Warehouse Hadoop Storage/Compute
Hive Aim to simplify usage of Hadoop A system for managing and querying structured and semistructured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on HDFS files Key Building Principles SQL is a familiar language Extensibility Types, Functions, Formats, Scripts Performance
Hive Simplifying usage of Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.23-dev-streaming.jar -input / user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh - file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey - numreducetasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*
Hive Architecture
Hive Data/Query Model Looks and behaves almost like a regular database Data Model Tables with typed columns Flexible types and storage formats Query Model Flavor of SQL for analytics queries Extensible via user defined functions and custom map/reduce scripts
Data Model Hive Entity Sample Metastore Entity Sample HDFS Location Table T /wh/t Partition date=d1 /wh/t/date=d1 Bucketing column External Table userid extt /wh/t/date=d1/part-0000 /wh/t/date=d1/part-1000 (hashed on userid) /wh2/existing/dir (arbitrary location)
Data Model Tables Analogous to tables in relational DBs Each table has corresponding directory in HDFS Example Page views table name: pvs HDFS directory /wh/pvs
Data Model Partitions Analogous to dense indexes on partition columns Nested sub-directories in HDFS for each combination of partition column values Example Partition columns: ds, ctry HDFS subdirectory for ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=us HDFS subdirectory for ds = 20090801, ctry = CA /wh/pvs/ds=20090801/ctry=ca
Data Model Buckets Split data based on hash of a column - mainly for parallelism One HDFS file per bucket within partition sub-directory Example Bucket column: user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds=20090801/ctry=us/part-00000 HDFS file for user hash bucket 20 /wh/pvs/ds=20090801/ctry=us/part-00020
Data Model External Tables Point to existing data directories in HDFS Can create tables and partitions partition columns just become annotations to external directories Example: create external table with partitions CREATE EXTERNAL TABLE pvs(userid int, pageid int, ds string, ctry string) PARTITIONED ON (ds string, ctry string) STORED AS textfile LOCATION /path/to/existing/table Example: add a partition to external table ALTER TABLE pvs ADD PARTITION (ds= 20090801, ctry= US ) LOCATION /path/to/existing/partition
Example Application Status updates table: status_updates(userid int, status string, ds string) Load the data from log files: LOAD DATA LOCAL INPATH /logs/status_updates INTO TABLE status_updates PARTITION (ds= 2009-03-20 ) User profile table profiles(userid int, school string, gender int)
Example Query Plan (Filter) Filter status updates containing michael jackson SELECT * FROM status_updates WHERE status LIKE michael jackson
Example Query Plan (Aggregation) Figure out total number of status_updates in a given day SELECT COUNT(1) FROM status_updates WHERE ds = 2009-08-01
Hive Query Language Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Complex object types: List of Maps Pluggable Data Formats Apache Log Format Columnar Storage Format
Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture Nearly 100% of hadoop jobs in the warehouse go through Hive. TRANSFORM scripts (any language) Serialization+IPC overhead Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters
Hive is an open system Different on-disk data formats Text File, Sequence File, Different in-memory data formats Java Integer/String, Hadoop IntWritable/Text User-provided map/reduce scripts In any language, use stdin/stdout to transfer data User-defined Functions Substr, Trim, From_unixtime User-defined Aggregation Functions Sum, Average
File Format Example CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE; LOAD DATA INPATH '/user/myname/log.txt' INTO TABLE mylog;
Existing File Formats TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel.
SerDe Examples CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; CREATE table mylog_rc ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.columnarserde' STORED AS RCFILE;
SerDe SerDe is short for serialization/deserialization. It controls the format of a row. Serialized format: Delimited format (tab, comma, ctrl-a ) Thrift Protocols Deserialized (in-memory) format: Java Integer/String/ArrayList/HashMap Hadoop Writable classes User-defined Java Classes (Thrift)
Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);
Comparison of UDF/UDAF v.s. M/R scripts UDF/UDAF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster Slower
Common Join Task A Table X Common Join Task Table Y Mapper Mapper Mapper Mapper Mapper Mapper Shuffle Reducer
Join in Map Reduce page_view key value key value pageid userid time 111 <1,1> 111 <1,1> 1 111 9:08:01 2 111 9:08:13 111 <1,2> 111 <1,2> 1 222 9:08:14 user Map 222 <1,1> Shuffle Sort 111 <2,25> Reduce userid age gender key value key value 111 25 female 222 32 male 111 <2,25> 222 <2,32> 222 <1,1> 222 <2,32>
Auto Map-Join
Auto Map-Join
Auto Map-Join
Bucketized Map-Join
Sort Merge Bucket Map-Join
Hive but is not Enough! Workflow specification, schedule and execution framework Workflows are DAGS Nodes are data transfers and transformations Edges are dependencies between nodes Reporting and Dashboard Tools HiveQuery/Workflow Authoring Tools Warehouse management Track space and cpu usage of the cluster Capacity planning for growth
Warehouse Challenges
Warehouse Challenges Growth Data, data, and more data
Growth Numbers Facebook Users (million) Queries/ Day Scribe Data GB/ Day Nodes Size TB (Total) March 2008 March 2012 Growth 14X 60X 250X 260X 2500X
HDFS Normal Deployment NameNode Data Node 1 Data Node 2 Data Node 3
First Attempts Concatenate old tables/partitions Alter table partition <p> concatenate No need to compress/uncompress the data for RCFile Hadoop Archive File Needed for bucketed files Upgrade Namenode
HDFS Hacked Federation NN1 NN2 DN1 DN1 DN2 DN2 DN3 DN3
HDFS - Federated Deployment NameNode1 NameNode2 Data Node 1 Data Node 2 Data Node 3
HDFS Layout NEW Map Reduce HDFS Cluster with mul?ple Name Nodes
Corona Hive Query Task Tracker M Hive CLI + Job Client Job Tracker heartbeat Task Tracker Task Tracker R
Hadoop Corona Split the current Job Tracker Cluster Manager to manage resources/nodes One Corona Job Tracker per job Corona Job Tracker requests resources from Cluster Manager Small amount of state in Cluster Manager Can restart
Corona Hive Query Cluster Manager Task Tracker M Hive CLI + Job Client + Job Tracker heartbeat Task Tracker Task Tracker R
Warehouse Challenges Growth Isolation Space Isolation Compute Isolation Failure Isolation
Isolation - Now Hardware isolation Platinum cluster & Silver cluster Partial compute isolation Pools Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster
Challenges: Isolation Replica?on Pla3num Silver
Isolation Pools FIFO within each pool
TEAM Minimum Slots ADS BI COEFFICIENT GROWTH SCRAPING INSIGHTS NETEGO PLATFORM
Isolation - Future Logical namespace per team Namespace encompasses Transport capacity (scribe) Realtime analytics capacity (puma) Storage capacity (hive tables) Compute capacity (periodic/adhoc analyses) Resource accountability per namespace Pools computed dynamically
Isolation - Future NEW NS1 NS2 NS3 Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster
Challenges: Testing Shadow testing with multiple DFS and MR clusters SILVER BRONZE DFS1 DFS5 DFSTEMP
Testing Snapshot cluster Queries for a day Track cpu/byte for top 100 queries
Warehouse Challenges Growth Isolation Multiple Regions Hadoop picky about new capacity requirements Need to use any capacity in any location Need to share data between regions
Multi Region Hive1 Map Reduce Replica?on HDFS
Project Prism Hive1 NS1 NS2 NS3 Replica?on Pool1 Pool2 Pool3 HDFS Cluster Central Namespace Server Replica?on Replica?on
Interactive Query - Peregrine
Peregrine Fast Approximate results Memory bound No Join/sub-query support
Open Source Hadoop Facebook has its internal branch Releases to github periodically Hive Development is in apache Pulls into internal branch periodically
Hive Open Projects Testing Benchmark Data Generator
Hive Open Projects Performance Materialized Views Cost-based optimizer for Hive Index Joins Better skew handling techniques Map-reduce-reduce-reduce* Hash without sort on map-reduce boundary
Questions?