Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever David P. Mariani AtScale, Inc. September 16, 2013
THE TRUTH ABOUT DATA We think only 3% of the potentially useful data is tagged, and even less is analyzed. Source: IDC Predictions 2013: Big Data, IDC 90% of the data in the world today has been created in the last two years Source: IBM 2
The Broken Promise What we wanted A centralized data warehouse What we got Departmental data marts Sales Finance Marketing Sales Finance CRM Marketing Centralized Data Warehouse CRM Couldn t handle: Volume, Velocity & Variety 3
A new way to manage data Requirement Traditional Databases Hadoop Capture/Store Write Many Write Once Volume Model/Map Structured Semi Structured Variety Transform/Load Early Late Velocity 4
It s time for a new approach 1990 s Relational DBs 2000 s MPP DBs Now Hadoop + Hive Capture Capture File Server File Server ETL Tool Extract Transform Load ETL Tool Extract Load Hadoop + Hive Capture Map Transform Query Query Engine Query Query Engine Transform Query 5
Example 1: Klout 6
Example 1: Klout s Big Data 15 Social Networks Processed Every Day 769 Terabytes of Data Storage 200,000 Indexed Users Added Every Day 400,000,000 Users Indexed Every Day 12,000,000,000 Social Signals Processed Every Day 50,000,000,000 API Calls Delivered Every Month 1,080,000,000,000 Rows of Data In Data Warehouse 7
Example 1: Klout data architecture Serving UX Data Pipeline & Factory Registrations DB (MySql) Klout.com (Node.js) Signal Collectors (Java/Scal Data Enhancemen a) t Data Engine (PIG/Hive) Warehouse (Hive) Profile DB (HBase) Search Index (Elastic Search) Klout API (Scala) Mobile (ObjectiveC) Partner API (Mashery) Streams (MongoDB) Serving Stores Analytic s Monitoring (Nagios) Dashboards (Tableau) Analytic s Cubes (SSAS) Perks Analyics (Scala) Event Tracker (Scala)
Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report
Example 1: Klout Event Tracker insights3:9003/track/{"project": plu Warehouse sk","event": spend, "ks_uid":123456, type": add_topic"} Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report
Example 1: Klout { Event Tracker Tracker API node.js Log Process Flume "project":"plusk", "event":"spend", "session_id":"0", "ip":"50.68.47.158", "kloutid": 123456", cookie_id": 123456", "ref":"http://klout.com/", "type":"add_topic", "time":"1338366015" } Warehouse Cube Analysis Services Klout UI AJAX UX will be saved in HDFS at: /logs/events_tracking/2012 05 30/0100 Instrument Collect Persist Query Report
Example 1: Klout Event Tracker Tracker API node.js Log Process Flume Warehouse EVENT_LOG tstamp INT project STRING event STRING session_id Cube BIGINT ks_uidanalysis BIGINT ip string Services json_map MAP<STRING,STRING> json_text STRING dt STRING hr STRING Klout UI AJAX UX Instrument Collect Persist Query Report
Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report
Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter]} ON COLUMNS, NON EMPTY CROSSJOIN ( exists([date].[date].[date].allmembers, [Date].[Date].&[2012 05 19T00:00:00]:[Date].[Date].&[2012 06 02T00:00:00]), [Events].[Event].[Event].allmembers ) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS FROM [ProductInsight] WHERE ({[Projects].[Project].[plusK]}) Instrument Collect Persist Query Report
Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX SELECT get_json_object(json_text,'$.sid') as sid, get_json_object(json_text,'$.kloutid') as kloutid, get_json_object(json_text,'$.v') as version, get_json_object(json_text,'$.status') as status, event FROM bi.event_log WHERE project='mobile ios' AND tstamp=20121027 AND event in ('api_error', 'api_timeout') ORDER BY sid; Instrument Collect Persist Query Report
Example 2: Online Gaming Company Capture LogIn\t1369155542\t4533245\t loc": 23,"rank":"Expert,"client":"ios"\lf Buy\t1369155556\t4533446\t loc": 23,"item":"212,"ref : ask.com,"amt":"1.50"\lf Map CREATE EXTERNAL TABLE event_log ( event STRING, event_time TIMESTAMP, user_id INTEGER, event_attributes MAP<STRING, STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEM TERMINATED BY ',' PARTITIONED BY (day(from_unixtime(event_time)), INTEGER) LOCATION '/user/event_logs ; Transform + Query SELECT SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent, event_attributes[ loc"] AS Location, count(*) AS EventCount FROM event_log WHERE year(from_unixtime(event_time)) = 2013 GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[ loc"] 16
Hive began as a batch tool Batch Registrations DB (MySql) Klout.com (Node.js) Signal Collectors (Java/Scal Data Enhancemen a) t Data Engine (PIG/Hive) Warehouse (Hive) Profile DB (HBase) Search Index (Elastic Search) Klout API (Scala) Mobile (ObjectiveC) Partner API (Mashery) Streams (MongoDB) Monitoring (Nagios) Interactive Serving Stores Dashboards (Tableau) Analytic s Cubes (SSAS) Perks Analyics (Scala) Event Tracker (Scala)
Hive now has interactive flavors Shark Impala Stinger Performance approach Use RAM Replace MR Improve Hive Theoretical limits (# of rows) Billions Trillions Trillions Supports UDFs, SerDes Yes Soon Yes Supports non scalar data types Yes Soon Yes Preferred file format Tachyon Parquet ORC Sponsorship AMPLab Cloudera Hortonworks Table: Hive compatible interactive query engines 18
Hive is an inexpensive MPP database TPC-H Query Run Times (Impala vs. HANA) (lineitem table 60 Million Rows) Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoopimpala-on-aws HANA Small Impala Small (1 Node) Parquet Time (Seconds) Impala Small (3 Nodes) Parquet Impala Small (1 Node) Text Impala Small (3 Nodes) Text Records Select Statement Returned select count(*) from lineitem 1 1 3 1 74 31 select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29 select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode 7 8 23 5 74 28 select l_shipmode, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29 select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 23 5 73 30 select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31 select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 104 21 73 30 (5 Part.) 1.9Gb (40 files x 80mb) 3.2Gb (1 file No Compression) 7.2Gb Size Est. Monthly Cost of Production Environment on AWS (HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350 19
Demonstration Hive vs. Impala
Summary Hadoop will disrupt the data warehousing ecosystem Consider Hadoop/Hive for new applications Rethink how you capture & store data Capture as much as possible but don t aggregate/normalize it Dimensional modeling is still but much less constricting) Impose a schema as late as possible (at query time if possible) 21
Contact Information If you have further questions or comments: David P. Mariani AtScale, Inc. dave@atscale.com @dmariani 22