Data Warehouse Overview. Namit Jain

Transcription

1 Data Warehouse Overview Namit Jain

2 Agenda Why data? Life of a tag for data infrastructure Warehouse architecture Challenges Summarizing

3 Data Science peace.facebook.com Friendships on Facebook

4 Data Science - facebook.com/data Gross National Happiness

5 Data Analyses

6 Data-enhanced Products People You May Know (PYMK) Newsfeed ranking Ads optimization Index building for search

7 External Reporting Social Plugin Insights

8 Internal Reporting Product Insights Data-driven product development Allows products to iterate quickly by observing user behavior

9 Life of a tag for data infrastructure

10 Facebook Architecture (Simplified)

11 Facebook Architecture Data Sources Log Data (facts) Web-tier user activity logs View/click of an ad, liking a story, fanning a page, status update, Backend Services - Search, Newsfeed, Ads Facebook-Site related Data (dimensions) MySQL Descriptions of ads User demographics

12 Life of a tag for data infrastructure Periodic Analysis Adhoc Analysis Daily report on count of photo tags by country (1day) nocron hipal Count photos tagged by females age yesterday Scrapes Warehouse User info reaches Warehouse (1day) MySql copier/loader Log line reaches warehouse (1hr) User tags a photo Real?me Analy?cs Count users tagging photos in the last hour (1min) puma Scribe Log Storage Log line reaches Scribeh (1s) Log line generated: <user_id, photo_id>

13 Takeaways Log collection Realtime analysis Batch analysis Periodic analysis Interactive analysis

14 Takeaways Scribe/Calligraphus Puma/HBase Hive/Hadoop Databee/Chronos

15 Takeaways Open Source Scribe HBase Hive/Hadoop

16 Scribe Open Source, simple and scalable log collection system Web Tier Mid- Tier Warehouse

17 Challenges: Choosing the right stack? Hadoop/ Hive Oracle/ AsterData Sharded MySQL Cost Availability Scalability Performance ACID Ease of Use

18 Warehouse Architecture

19 Warehouse Architecture Storage (HDFS)

20 Warehouse Architecture Compute (MapReduce) Storage (HDFS)

21 Warehouse Architecture Compute (MapReduce) Storage (HDFS) Hadoop

22 Warehouse Architecture Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop

23 Warehouse Architecture Workflow (Nocron) Query (Hive) Compute (MapReduce) Storage (HDFS) Hadoop

24 What is Hadoop: Open Source Apache project Framework for running applications on large clusters of commodity hardware Scale: petabytes of data on thousands of nodes Hadoop layers: Storage layer: HDFS Processing layer: MapReduce Characteristics: Uses clusters of commodity computers Supports moving computation close to data Single storage + compute cluster vs. Separate clusters Scalable, fault tolerant, and easily managed But, not easy to program compared to databases(sql)

25 HDFS Data Model Data is logically organized into files and directories Files are divided into uniform-sized blocks Blocks are distributed across the nodes of the cluster and are replicated to handle hardware failure HDFS keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data 25

26 HDFS Architecture Client Metadata ops Namenode Metadata ops Metadata (Name, #replicas, ): /users/foo/data, 3, Block ops Read Datanodes Datanodes Replication Blocks Rack 1 Write Rack 2 Client 26

27 MapReduce Review - WordCount

28 Warehouse Hadoop Storage/Compute

29 Hive Aim to simplify usage of Hadoop A system for managing and querying structured and semistructured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on HDFS files Key Building Principles SQL is a familiar language Extensibility Types, Functions, Formats, Scripts Performance

30 Hive Simplifying usage of Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.23-dev-streaming.jar -input / user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh - file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey - numreducetasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*

31 Hive Architecture

32 Hive Data/Query Model Looks and behaves almost like a regular database Data Model Tables with typed columns Flexible types and storage formats Query Model Flavor of SQL for analytics queries Extensible via user defined functions and custom map/reduce scripts

33 Data Model Hive Entity Sample Metastore Entity Sample HDFS Location Table T /wh/t Partition date=d1 /wh/t/date=d1 Bucketing column External Table userid extt /wh/t/date=d1/part-0000 /wh/t/date=d1/part-1000 (hashed on userid) /wh2/existing/dir (arbitrary location)

34 Data Model Tables Analogous to tables in relational DBs Each table has corresponding directory in HDFS Example Page views table name: pvs HDFS directory /wh/pvs

35 Data Model Partitions Analogous to dense indexes on partition columns Nested sub-directories in HDFS for each combination of partition column values Example Partition columns: ds, ctry HDFS subdirectory for ds = , ctry = US /wh/pvs/ds= /ctry=us HDFS subdirectory for ds = , ctry = CA /wh/pvs/ds= /ctry=ca

36 Data Model Buckets Split data based on hash of a column - mainly for parallelism One HDFS file per bucket within partition sub-directory Example Bucket column: user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds= /ctry=us/part HDFS file for user hash bucket 20 /wh/pvs/ds= /ctry=us/part-00020

37 Data Model External Tables Point to existing data directories in HDFS Can create tables and partitions partition columns just become annotations to external directories Example: create external table with partitions CREATE EXTERNAL TABLE pvs(userid int, pageid int, ds string, ctry string) PARTITIONED ON (ds string, ctry string) STORED AS textfile LOCATION /path/to/existing/table Example: add a partition to external table ALTER TABLE pvs ADD PARTITION (ds= , ctry= US ) LOCATION /path/to/existing/partition

38 Example Application Status updates table: status_updates(userid int, status string, ds string) Load the data from log files: LOAD DATA LOCAL INPATH /logs/status_updates INTO TABLE status_updates PARTITION (ds= ) User profile table profiles(userid int, school string, gender int)

39 Example Query Plan (Filter) Filter status updates containing michael jackson SELECT * FROM status_updates WHERE status LIKE michael jackson

40 Example Query Plan (Aggregation) Figure out total number of status_updates in a given day SELECT COUNT(1) FROM status_updates WHERE ds =

41 Hive Query Language Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Complex object types: List of Maps Pluggable Data Formats Apache Log Format Columnar Storage Format

42 Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture Nearly 100% of hadoop jobs in the warehouse go through Hive. TRANSFORM scripts (any language) Serialization+IPC overhead Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters

43 Hive is an open system Different on-disk data formats Text File, Sequence File, Different in-memory data formats Java Integer/String, Hadoop IntWritable/Text User-provided map/reduce scripts In any language, use stdin/stdout to transfer data User-defined Functions Substr, Trim, From_unixtime User-defined Aggregation Functions Sum, Average

44 File Format Example CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE; LOAD DATA INPATH '/user/myname/log.txt' INTO TABLE mylog;

45 Existing File Formats TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel.

46 SerDe Examples CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; CREATE table mylog_rc ( user_id BIGINT, page_url STRING, unix_time INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.columnarserde' STORED AS RCFILE;

47 SerDe SerDe is short for serialization/deserialization. It controls the format of a row. Serialized format: Delimited format (tab, comma, ctrl-a ) Thrift Protocols Deserialized (in-memory) format: Java Integer/String/ArrayList/HashMap Hadoop Writable classes User-defined Java Classes (Thrift)

48 Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);

49 Comparison of UDF/UDAF v.s. M/R scripts UDF/UDAF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster Slower

50 Common Join Task A Table X Common Join Task Table Y Mapper Mapper Mapper Mapper Mapper Mapper Shuffle Reducer

51 Join in Map Reduce page_view key value key value pageid userid time 111 <1,1> 111 <1,1> :08: :08: <1,2> 111 <1,2> :08:14 user Map 222 <1,1> Shuffle Sort 111 <2,25> Reduce userid age gender key value key value female male 111 <2,25> 222 <2,32> 222 <1,1> 222 <2,32>

52 Auto Map-Join

53 Auto Map-Join

54 Auto Map-Join

55 Bucketized Map-Join

56 Sort Merge Bucket Map-Join

57 Hive but is not Enough! Workflow specification, schedule and execution framework Workflows are DAGS Nodes are data transfers and transformations Edges are dependencies between nodes Reporting and Dashboard Tools HiveQuery/Workflow Authoring Tools Warehouse management Track space and cpu usage of the cluster Capacity planning for growth

58 Warehouse Challenges

59 Warehouse Challenges Growth Data, data, and more data

60 Growth Numbers Facebook Users (million) Queries/ Day Scribe Data GB/ Day Nodes Size TB (Total) March 2008 March 2012 Growth 14X 60X 250X 260X 2500X

61 HDFS Normal Deployment NameNode Data Node 1 Data Node 2 Data Node 3

62 First Attempts Concatenate old tables/partitions Alter table partition <p> concatenate No need to compress/uncompress the data for RCFile Hadoop Archive File Needed for bucketed files Upgrade Namenode

63 HDFS Hacked Federation NN1 NN2 DN1 DN1 DN2 DN2 DN3 DN3

64 HDFS - Federated Deployment NameNode1 NameNode2 Data Node 1 Data Node 2 Data Node 3

65 HDFS Layout NEW Map Reduce HDFS Cluster with mul?ple Name Nodes

66 Corona Hive Query Task Tracker M Hive CLI + Job Client Job Tracker heartbeat Task Tracker Task Tracker R

67 Hadoop Corona Split the current Job Tracker Cluster Manager to manage resources/nodes One Corona Job Tracker per job Corona Job Tracker requests resources from Cluster Manager Small amount of state in Cluster Manager Can restart

68 Corona Hive Query Cluster Manager Task Tracker M Hive CLI + Job Client + Job Tracker heartbeat Task Tracker Task Tracker R

69 Warehouse Challenges Growth Isolation Space Isolation Compute Isolation Failure Isolation

70 Isolation - Now Hardware isolation Platinum cluster & Silver cluster Partial compute isolation Pools Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster

71 Challenges: Isolation Replica?on Pla3num Silver

72 Isolation Pools FIFO within each pool

73 TEAM Minimum Slots ADS BI COEFFICIENT GROWTH SCRAPING INSIGHTS NETEGO PLATFORM

74 Isolation - Future Logical namespace per team Namespace encompasses Transport capacity (scribe) Realtime analytics capacity (puma) Storage capacity (hive tables) Compute capacity (periodic/adhoc analyses) Resource accountability per namespace Pools computed dynamically

75 Isolation - Future NEW NS1 NS2 NS3 Pool1 Pool2 Pool3 Map Reduce Cluster HDFS Cluster

76 Challenges: Testing Shadow testing with multiple DFS and MR clusters SILVER BRONZE DFS1 DFS5 DFSTEMP

77 Testing Snapshot cluster Queries for a day Track cpu/byte for top 100 queries

78 Warehouse Challenges Growth Isolation Multiple Regions Hadoop picky about new capacity requirements Need to use any capacity in any location Need to share data between regions

79 Multi Region Hive1 Map Reduce Replica?on HDFS

80 Project Prism Hive1 NS1 NS2 NS3 Replica?on Pool1 Pool2 Pool3 HDFS Cluster Central Namespace Server Replica?on Replica?on

81 Interactive Query - Peregrine

82 Peregrine Fast Approximate results Memory bound No Join/sub-query support

83 Open Source Hadoop Facebook has its internal branch Releases to github periodically Hive Development is in apache Pulls into internal branch periodically

84 Hive Open Projects Testing Benchmark Data Generator

85 Hive Open Projects Performance Materialized Views Cost-based optimizer for Hive Index Joins Better skew handling techniques Map-reduce-reduce-reduce* Hash without sort on map-reduce boundary

86 Questions?