Real-Time Data Analytics and Visualization



Similar documents
Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Hadoop and MySQL for Big Data

The Internet of Things and Big Data: Intro

Reference Architecture, Requirements, Gaps, Roles

MapR: Best Solution for Customer Success

How To Handle Big Data With A Data Scientist

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

AtScale Intelligence Platform

The Future of Data Management

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Native Connectivity to Big Data Sources in MSTR 10

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

VIEWPOINT. High Performance Analytics. Industry Context and Trends

The Inside Scoop on Hadoop

Production ready hadoop. By Deepak Rao Na,onal Head Datawarehousing Bajaj Finserv

Tap into Hadoop and Other No SQL Sources

Unified Big Data Processing with Apache Spark. Matei

Bringing Big Data to People

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The Future of Data Management with Hadoop and the Enterprise Data Hub

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Ganzheitliches Datenmanagement

Cost-Effective Business Intelligence with Red Hat and Open Source

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Big Data and Its Impact on the Data Warehousing Architecture

Oracle Big Data SQL Technical Update

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Parquet. Columnar storage for the people

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Big Data Scenario mit Power BI vs. SAP HANA Gerhard Brückl

Big Data Technologies Compared June 2014

Using RDBMS, NoSQL or Hadoop?

SQL on NoSQL (and all of the data) With Apache Drill

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

IST722 Data Warehousing

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Traditional BI vs. Business Data Lake A comparison

Moving From Hadoop to Spark

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

HDP Enabling the Modern Data Architecture

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Big Data and Market Surveillance. April 28, 2014

Building Your Big Data Team

Microsoft Analytics Platform System. Solution Brief

Apache Kylin Introduction Dec 8,

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cisco IT Hadoop Journey

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BI, Analytics and Big Data A Modern-Day Perspective

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

More Data in Less Time

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SAS ANALYTIC SOLUTIONS RUNNING ON A HADOOP CLUSTER USING YARN JAMES KOCHUBA. Copyright 2015, SAS Institute Inc. All rights reserved.

Big Data Processing: Past, Present and Future

Actian SQL in Hadoop Buyer s Guide

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

In-memory computing with SAP HANA

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Information Architecture

Big Data Introduction

Constructing a Data Lake: Hadoop and Oracle Database United!

Dominik Wagenknecht Accenture

Ibis: Scaling Python Analy=cs on Hadoop and Impala

TOP 8 TRENDS FOR 2016 BIG DATA

Next-Gen Big Data Analytics using the Spark stack

Driving Peak Performance IBM Corporation

HDP Hadoop From concept to deployment.

Best Practices for Hadoop Data Analysis with Tableau

PLATFORA SOLUTION ARCHITECTURE

Data Warehouse Optimization

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Hadoop & Spark Using Amazon EMR

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Transcription:

Real-Time Data Analytics and Visualization Making the leap to BI on Hadoop Predictive Analytics & Business Insights 2015 February 9, 2015 David P. Mariani CEO, AtScale, Inc.

THE TRUTH ABOUT DATA We think only 3% of the potentially useful data is tagged, and even less is analyzed. Source: IDC Predictions 2013: Big Data, IDC 90% of the data in the world today has been created in the last two years Source: IBM 2 2

What We Wanted The Centralized Broken Data Warehouse Promise

What We Got Data Marts

What We Wanted Centralized Data Warehouse

What is Hadoop? Distributed File System (HDFS) Designed for commodity hardware Supports any file format (SerDes) Linearly scalable, parallel 7

What is Hive? SQL-like interface on top of Hadoop Has become the semantic layer for Hadoo p Originally designed for batch processing Now has interactive flavors 8

Hive Now Comes in Several Flavors Feature Spark SQL Impala Performance approach Caching Optimizer Hive/T ez Improve Hive Drill Optimizer Theoretical limits (# of rows) Billions Trillions Trillions Trillions Supports UDFs, SerDes Yes Soon Yes Yes Supports non-scalar data types Yes Soon Yes Yes Preferred file format Tachyon Parquet ORC Parquet Sponsorship Databricks Cloudera Hortonworks MapR 9

Hive is a Cheap MPP Database TPC-H Query Run Times (Impala vs. HANA) (lineitem table 60 Million Rows) HANA Small Impala Small (1 Node) Parquet Time (Seconds) Impala Small (3 Nodes) Parquet Impala Small (1 Node) Text Impala Small (3 Nodes) Text Records Select Statement Returned select count(*) from lineitem 1 1 3 1 74 31 select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29 select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode 7 8 23 5 74 28 select l_shipmode, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29 select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 23 5 73 30 select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31 select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 104 21 73 30 (5 Part.) 1.9Gb (40 files x 80mb) 3.2Gb (1 file No Compression) 7.2Gb Size Est. Monthly Cost of Production Environment on AWS (HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350 Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws 10

WHAT WE GOT ETL + STAR SCHEMAS

Traditional Data Architecture ANALYSIS TOOLS QUERY ENGINE MART MART MART ETL DATA WAREHOUSE INPUT DATA 12

What s Wrong with this Picture? ANALYSIS TOOLS QUERY ENGINE MART MART MART ETL Highly complex Lots of people & skillsets Multiple copies of data Stale data Rigid schema Tough to change DATA WAREHOUSE INPUT DATA Write Many Structured Data Schema on Load 13

It Takes an Army SAN/NAS Engineer Define Storage Architecture Data Warehouse Architect Design Star Schema DBA Create Tables ETL Engineer Write ETL Code DBA Automate Data Load BI Engineer Design Cube ETL Engineer Automate Cube Load BI Engineer Design Reports/Dashboards 14

Star Schema = Unnatural! 15

WHAT WE WANTED SCHEMA ON DEMAND

The New Way: Eliminate Layers Traditional Approach ANALYSIS TOOLS New Approach ANALYSIS TOOLS QUERY ENGINE HADOOP MART MART MART INPUT DATA ETL DATA WAREHOUSE INPUT DATA 17

Map & Transform on Read VS Write Once Nested, Loosely Structured Schema on Read

Not This, That SAN/NAS Engineer Define Storage Architecture Data Warehouse Architect Design Star Schema Hadoop Engineer Define location to store files DBA Create Tables ETL Engineer Write ETL Code DBA Automate Data Load VS Hadoop Engineer Create EXTERNAL Tables BI Engineer Design Cube ETL Engineer Automate Cube Load BI Engineer Design Reports/Dashboards BI Engineer Run Queries/Create Cubes 19

Example: Key-Values using Maps

Example: JSON

DEMO MOBA Game Analytics

Demo: DOTA 2 What the User Sees Key Data Points: 5 vs. 5 players per match. Players choose Heroes, use Items & earn Gold. 23

FOR THE DATA SCIENTISTS!

Demo: Dota2 Raw Data (JSON) Match Details Player Details Player Profile View Source View Source

As Easy As 1,2,3 Hadoop Engineer Define location to store files Hadoop Engineer Create EXTERNAL Tables BI Engineer Run Queries/Create Cubes 26

Demo: DOTA 2 Use Case 1 Question: Who are the most popular heroes? 27

Demo: DOTA 2 Use Case 2 Question: Which heroes have the highest win rate? 28

Demo: DOTA 2 Use Case 3 Question: What are the top 3 items associated with the best win rate? 29

Practical Applications Time Series Analysis (session data) Affinity Analysis Segmentation Analysis Many to Many 30

NO JOINS = HORIZONTAL SCALE

FOR THE ORDINARY HUMAN!

Define Data Modeler Consume Business Analysts 33

DEMO

Summary: The Do s & Don ts Do Don t Capture data as is Pre-aggregate data Apply schema on read Force schema on load Land new data on Hadoop Create a data warehouse Land new data on relational DBs Create data marts Leverage open source engines Invest in proprietary databases 35

Business Intelligence Redefined