Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013



Similar documents
Real-Time Data Analytics and Visualization

Hadoop and MySQL for Big Data

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

The Internet of Things and Big Data: Intro

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Tap into Hadoop and Other No SQL Sources

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Parquet. Columnar storage for the people

Using distributed technologies to analyze Big Data

Next-Gen Big Data Analytics using the Spark stack

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Bringing Big Data to People

AtScale Intelligence Platform

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

How To Scale Out Of A Nosql Database

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Native Connectivity to Big Data Sources in MSTR 10

MySQL and Hadoop. Percona Live 2014 Chris Schneider

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

The Future of Data Management

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop & Spark Using Amazon EMR

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Analyzing Big Data with AWS

Peers Techno log ies Pv t. L td. HADOOP

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

IST722 Data Warehousing

Complete Java Classes Hadoop Syllabus Contact No:

Why Big Data in the Cloud?

HDP Hadoop From concept to deployment.

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Course Highlights

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Big Data Analytics - Accelerated. stream-horizon.com

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

TRAINING PROGRAM ON BIGDATA/HADOOP

Cloud Big Data Architectures

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Performance and Scalability Overview

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Apache Sentry. Prasad Mujumdar

the missing log collector Treasure Data, Inc. Muga Nishizawa

So What s the Big Deal?

Analytics on Spark &

Upcoming Announcements

SQL on NoSQL (and all of the data) With Apache Drill

Real Time Big Data Processing

Big Data Technologies Compared June 2014

How To Create A Data Visualization With Apache Spark And Zeppelin

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

HDP Enabling the Modern Data Architecture

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Cost-Effective Business Intelligence with Red Hat and Open Source

Testing 3Vs (Volume, Variety and Velocity) of Big Data

REAL-TIME BIG DATA ANALYTICS

Next-Generation Cloud Analytics with Amazon Redshift

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

How Companies are! Using Spark

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

}w!"#$%&'()+,-./012345<ya

Beyond Hadoop with Apache Spark and BDAS

Big Data & Cloud Computing. Faysal Shaarani

Big Data Processing: Past, Present and Future

Qsoft Inc

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Building a Data Warehouse

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Big Data Too Big To Ignore

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect


Cisco IT Hadoop Journey

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Transcription:

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever David P. Mariani AtScale, Inc. September 16, 2013

THE TRUTH ABOUT DATA We think only 3% of the potentially useful data is tagged, and even less is analyzed. Source: IDC Predictions 2013: Big Data, IDC 90% of the data in the world today has been created in the last two years Source: IBM 2

The Broken Promise What we wanted A centralized data warehouse What we got Departmental data marts Sales Finance Marketing Sales Finance CRM Marketing Centralized Data Warehouse CRM Couldn t handle: Volume, Velocity & Variety 3

A new way to manage data Requirement Traditional Databases Hadoop Capture/Store Write Many Write Once Volume Model/Map Structured Semi Structured Variety Transform/Load Early Late Velocity 4

It s time for a new approach 1990 s Relational DBs 2000 s MPP DBs Now Hadoop + Hive Capture Capture File Server File Server ETL Tool Extract Transform Load ETL Tool Extract Load Hadoop + Hive Capture Map Transform Query Query Engine Query Query Engine Transform Query 5

Example 1: Klout 6

Example 1: Klout s Big Data 15 Social Networks Processed Every Day 769 Terabytes of Data Storage 200,000 Indexed Users Added Every Day 400,000,000 Users Indexed Every Day 12,000,000,000 Social Signals Processed Every Day 50,000,000,000 API Calls Delivered Every Month 1,080,000,000,000 Rows of Data In Data Warehouse 7

Example 1: Klout data architecture Serving UX Data Pipeline & Factory Registrations DB (MySql) Klout.com (Node.js) Signal Collectors (Java/Scal Data Enhancemen a) t Data Engine (PIG/Hive) Warehouse (Hive) Profile DB (HBase) Search Index (Elastic Search) Klout API (Scala) Mobile (ObjectiveC) Partner API (Mashery) Streams (MongoDB) Serving Stores Analytic s Monitoring (Nagios) Dashboards (Tableau) Analytic s Cubes (SSAS) Perks Analyics (Scala) Event Tracker (Scala)

Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report

Example 1: Klout Event Tracker insights3:9003/track/{"project": plu Warehouse sk","event": spend, "ks_uid":123456, type": add_topic"} Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report

Example 1: Klout { Event Tracker Tracker API node.js Log Process Flume "project":"plusk", "event":"spend", "session_id":"0", "ip":"50.68.47.158", "kloutid": 123456", cookie_id": 123456", "ref":"http://klout.com/", "type":"add_topic", "time":"1338366015" } Warehouse Cube Analysis Services Klout UI AJAX UX will be saved in HDFS at: /logs/events_tracking/2012 05 30/0100 Instrument Collect Persist Query Report

Example 1: Klout Event Tracker Tracker API node.js Log Process Flume Warehouse EVENT_LOG tstamp INT project STRING event STRING session_id Cube BIGINT ks_uidanalysis BIGINT ip string Services json_map MAP<STRING,STRING> json_text STRING dt STRING hr STRING Klout UI AJAX UX Instrument Collect Persist Query Report

Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX Instrument Collect Persist Query Report

Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter]} ON COLUMNS, NON EMPTY CROSSJOIN ( exists([date].[date].[date].allmembers, [Date].[Date].&[2012 05 19T00:00:00]:[Date].[Date].&[2012 06 02T00:00:00]), [Events].[Event].[Event].allmembers ) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS FROM [ProductInsight] WHERE ({[Projects].[Project].[plusK]}) Instrument Collect Persist Query Report

Example 1: Klout Event Tracker Warehouse Tracker API node.js Log Process Flume Cube Analysis Services Klout UI AJAX UX SELECT get_json_object(json_text,'$.sid') as sid, get_json_object(json_text,'$.kloutid') as kloutid, get_json_object(json_text,'$.v') as version, get_json_object(json_text,'$.status') as status, event FROM bi.event_log WHERE project='mobile ios' AND tstamp=20121027 AND event in ('api_error', 'api_timeout') ORDER BY sid; Instrument Collect Persist Query Report

Example 2: Online Gaming Company Capture LogIn\t1369155542\t4533245\t loc": 23,"rank":"Expert,"client":"ios"\lf Buy\t1369155556\t4533446\t loc": 23,"item":"212,"ref : ask.com,"amt":"1.50"\lf Map CREATE EXTERNAL TABLE event_log ( event STRING, event_time TIMESTAMP, user_id INTEGER, event_attributes MAP<STRING, STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEM TERMINATED BY ',' PARTITIONED BY (day(from_unixtime(event_time)), INTEGER) LOCATION '/user/event_logs ; Transform + Query SELECT SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent, event_attributes[ loc"] AS Location, count(*) AS EventCount FROM event_log WHERE year(from_unixtime(event_time)) = 2013 GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[ loc"] 16

Hive began as a batch tool Batch Registrations DB (MySql) Klout.com (Node.js) Signal Collectors (Java/Scal Data Enhancemen a) t Data Engine (PIG/Hive) Warehouse (Hive) Profile DB (HBase) Search Index (Elastic Search) Klout API (Scala) Mobile (ObjectiveC) Partner API (Mashery) Streams (MongoDB) Monitoring (Nagios) Interactive Serving Stores Dashboards (Tableau) Analytic s Cubes (SSAS) Perks Analyics (Scala) Event Tracker (Scala)

Hive now has interactive flavors Shark Impala Stinger Performance approach Use RAM Replace MR Improve Hive Theoretical limits (# of rows) Billions Trillions Trillions Supports UDFs, SerDes Yes Soon Yes Supports non scalar data types Yes Soon Yes Preferred file format Tachyon Parquet ORC Sponsorship AMPLab Cloudera Hortonworks Table: Hive compatible interactive query engines 18

Hive is an inexpensive MPP database TPC-H Query Run Times (Impala vs. HANA) (lineitem table 60 Million Rows) Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoopimpala-on-aws HANA Small Impala Small (1 Node) Parquet Time (Seconds) Impala Small (3 Nodes) Parquet Impala Small (1 Node) Text Impala Small (3 Nodes) Text Records Select Statement Returned select count(*) from lineitem 1 1 3 1 74 31 select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29 select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode 7 8 23 5 74 28 select l_shipmode, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29 select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 23 5 73 30 select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31 select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 104 21 73 30 (5 Part.) 1.9Gb (40 files x 80mb) 3.2Gb (1 file No Compression) 7.2Gb Size Est. Monthly Cost of Production Environment on AWS (HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350 19

Demonstration Hive vs. Impala

Summary Hadoop will disrupt the data warehousing ecosystem Consider Hadoop/Hive for new applications Rethink how you capture & store data Capture as much as possible but don t aggregate/normalize it Dimensional modeling is still but much less constricting) Impose a schema as late as possible (at query time if possible) 21

Contact Information If you have further questions or comments: David P. Mariani AtScale, Inc. dave@atscale.com @dmariani 22