Real-Time Data Analytics and Visualization Making the leap to BI on Hadoop Predictive Analytics & Business Insights 2015 February 9, 2015 David P. Mariani CEO, AtScale, Inc.
THE TRUTH ABOUT DATA We think only 3% of the potentially useful data is tagged, and even less is analyzed. Source: IDC Predictions 2013: Big Data, IDC 90% of the data in the world today has been created in the last two years Source: IBM 2 2
What We Wanted The Centralized Broken Data Warehouse Promise
What We Got Data Marts
What We Wanted Centralized Data Warehouse
What is Hadoop? Distributed File System (HDFS) Designed for commodity hardware Supports any file format (SerDes) Linearly scalable, parallel 7
What is Hive? SQL-like interface on top of Hadoop Has become the semantic layer for Hadoo p Originally designed for batch processing Now has interactive flavors 8
Hive Now Comes in Several Flavors Feature Spark SQL Impala Performance approach Caching Optimizer Hive/T ez Improve Hive Drill Optimizer Theoretical limits (# of rows) Billions Trillions Trillions Trillions Supports UDFs, SerDes Yes Soon Yes Yes Supports non-scalar data types Yes Soon Yes Yes Preferred file format Tachyon Parquet ORC Parquet Sponsorship Databricks Cloudera Hortonworks MapR 9
Hive is a Cheap MPP Database TPC-H Query Run Times (Impala vs. HANA) (lineitem table 60 Million Rows) HANA Small Impala Small (1 Node) Parquet Time (Seconds) Impala Small (3 Nodes) Parquet Impala Small (1 Node) Text Impala Small (3 Nodes) Text Records Select Statement Returned select count(*) from lineitem 1 1 3 1 74 31 select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29 select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode 7 8 23 5 74 28 select l_shipmode, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29 select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 23 5 73 30 select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31 select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 104 21 73 30 (5 Part.) 1.9Gb (40 files x 80mb) 3.2Gb (1 file No Compression) 7.2Gb Size Est. Monthly Cost of Production Environment on AWS (HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350 Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws 10
WHAT WE GOT ETL + STAR SCHEMAS
Traditional Data Architecture ANALYSIS TOOLS QUERY ENGINE MART MART MART ETL DATA WAREHOUSE INPUT DATA 12
What s Wrong with this Picture? ANALYSIS TOOLS QUERY ENGINE MART MART MART ETL Highly complex Lots of people & skillsets Multiple copies of data Stale data Rigid schema Tough to change DATA WAREHOUSE INPUT DATA Write Many Structured Data Schema on Load 13
It Takes an Army SAN/NAS Engineer Define Storage Architecture Data Warehouse Architect Design Star Schema DBA Create Tables ETL Engineer Write ETL Code DBA Automate Data Load BI Engineer Design Cube ETL Engineer Automate Cube Load BI Engineer Design Reports/Dashboards 14
Star Schema = Unnatural! 15
WHAT WE WANTED SCHEMA ON DEMAND
The New Way: Eliminate Layers Traditional Approach ANALYSIS TOOLS New Approach ANALYSIS TOOLS QUERY ENGINE HADOOP MART MART MART INPUT DATA ETL DATA WAREHOUSE INPUT DATA 17
Map & Transform on Read VS Write Once Nested, Loosely Structured Schema on Read
Not This, That SAN/NAS Engineer Define Storage Architecture Data Warehouse Architect Design Star Schema Hadoop Engineer Define location to store files DBA Create Tables ETL Engineer Write ETL Code DBA Automate Data Load VS Hadoop Engineer Create EXTERNAL Tables BI Engineer Design Cube ETL Engineer Automate Cube Load BI Engineer Design Reports/Dashboards BI Engineer Run Queries/Create Cubes 19
Example: Key-Values using Maps
Example: JSON
DEMO MOBA Game Analytics
Demo: DOTA 2 What the User Sees Key Data Points: 5 vs. 5 players per match. Players choose Heroes, use Items & earn Gold. 23
FOR THE DATA SCIENTISTS!
Demo: Dota2 Raw Data (JSON) Match Details Player Details Player Profile View Source View Source
As Easy As 1,2,3 Hadoop Engineer Define location to store files Hadoop Engineer Create EXTERNAL Tables BI Engineer Run Queries/Create Cubes 26
Demo: DOTA 2 Use Case 1 Question: Who are the most popular heroes? 27
Demo: DOTA 2 Use Case 2 Question: Which heroes have the highest win rate? 28
Demo: DOTA 2 Use Case 3 Question: What are the top 3 items associated with the best win rate? 29
Practical Applications Time Series Analysis (session data) Affinity Analysis Segmentation Analysis Many to Many 30
NO JOINS = HORIZONTAL SCALE
FOR THE ORDINARY HUMAN!
Define Data Modeler Consume Business Analysts 33
DEMO
Summary: The Do s & Don ts Do Don t Capture data as is Pre-aggregate data Apply schema on read Force schema on load Land new data on Hadoop Create a data warehouse Land new data on relational DBs Create data marts Leverage open source engines Invest in proprietary databases 35
Business Intelligence Redefined