Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Similar documents
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Native Connectivity to Big Data Sources in MSTR 10

How To Handle Big Data With A Data Scientist

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Tap into Hadoop and Other No SQL Sources

HDP Hadoop From concept to deployment.

The Future of Data Management

Big Data Can Drive the Business and IT to Evolve and Adapt

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Data Integration Checklist

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

The Internet of Things and Big Data: Intro

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Luncheon Webinar Series May 13, 2013

Big Data Defined Introducing DataStack 3.0

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

HDP Enabling the Modern Data Architecture

Getting Started Practical Input For Your Roadmap

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Technologies Compared June 2014

Ganzheitliches Datenmanagement

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Information Architecture

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Advanced In-Database Analytics

Cost-Effective Business Intelligence with Red Hat and Open Source

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Implement Hadoop jobs to extract business value from large and varied data sets

Achieving Business Value through Big Data Analytics Philip Russom

The Enterprise Data Hub and The Modern Information Architecture

Big Data solutions to support Intelligent Systems and Applications

Sunnie Chung. Cleveland State University

Business Intelligence for Big Data

SQream Technologies Ltd - Confiden7al

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Please give me your feedback

Performance and Scalability Overview

How To Use Big Data For Business

Performance and Scalability Overview

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

The Future of Data Management with Hadoop and the Enterprise Data Hub

This Symposium brought to you by

Can the Elephants Handle the NoSQL Onslaught?

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Making Sense of Big Data in Insurance

I/O Considerations in Big Data Analytics

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

The Inside Scoop on Hadoop

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Microsoft Analytics Platform System. Solution Brief

Making Sense of the Madness

2015 Ironside Group, Inc. 2

Big Data and Data Science: Behind the Buzz Words

Determine the Right Analytic Database: A Survey of New Data Technologies

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Il mondo dei DB Cambia : Tecnologie e opportunita`

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Advanced Big Data Analytics with R and Hadoop

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

#TalendSandbox for Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Oracle Big Data SQL Technical Update

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Real Time Big Data Processing

Talend Big Data. Delivering instant value from all your data. Talend

Artur Borycki. Director International Solutions Marketing

Introducing Oracle Exalytics In-Memory Machine

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Stinger Initiative: Introduction

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

NoSQL for SQL Professionals William McKnight

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Cleveland State University

Comprehensive Analytics on the Hortonworks Data Platform

Traditional BI vs. Business Data Lake A comparison

Big Data and Its Impact on the Data Warehousing Architecture

Database Performance with In-Memory Solutions

Testing Big data is one of the biggest

Enterprise Operational SQL on Hadoop Trafodion Overview

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Modern Data Architecture for Predictive Analytics

Bringing the Power of SAS to Hadoop

Big Data Success Step 1: Get the Technology Right

Transcription:

Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop a replacement or compliment for R-DBMS today? How does this change in the future? Is the current vendor co-processor approach the right model? If so, how do these technologies best work together? 2

What we see today Vendors offering MPP co-processors Vertica, Teradata, Greenplum, Netezza, etc + MapReduce Focus on good integration with Hadoop Parallel Movement Native access to each other s data Metadata Integration 3

What do you mean when you say Analytics Query Predictive Modeling OLAP Clustering/Segmentation EIS Reporting Warehouse Optimization Analytic Sandbox Affinity Analysis 4

Analytics and the Analytic Lifecycle Determine business objectives Understand business requirements Determine data mining goals Deployment Publish a report Deliver a score file Implement a model within an event processor Understand Bus. Understand Collect initial data Describe data Explore data Verify data quality Preparation Select data Clean data Construct data Integrate data Format data Modeling Evaluation Evaluate results Review process Determine next steps Select modeling technique Generate test design Build model Assess model 5

Testing classic BI queries on Hive Tested in HP s Performance Labs Working with Distro vendors Comparable clusters Suite of 22 TPC-H queries Not based on tuned, audited results 1600 1400 1200 1000 800 600 400 Geomean Elapsed Time(sec) 200 0 RDBMS1 RDBMS2 Hive 6

Digging a little deeper 12000 10000 8000 6000 RDBMS1 RDBMS2 Hive 4000 2000 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 7

Why the performance difference? Startup Time Example Q2 Very short runner 121 seconds on Hive even with 2GB of data <10 seconds on commercial DBMS s with 1TB of data Queries are turned into MapReduce jobs Runtime Performance Tuning Example Q1 Simple table scan MPP Rowstore DBMS is 74x faster Commercial DBMS s are highly optimized for scans Query Optimization Example Q7 Nested loop join and 4 hash joins MPP Rowstore DBMS is 64x faster Lack of options and lack of stats makes optimization tricky 8

30 Years of Performance Optimization Hadoop Hive Dynamic Programming Optimizer Generated MapReduce Java code Scan Optimizer Runtime Access Methods Commercial DBMS Branch and bound, cost based, histogram stats, Dynamic Recompile, etc Compiled runtime, generated bytecode, optimized psuedocode, chip optimized, etc B trees, columnar scans, zonemaps, bitmap indices, binary searches, etc HDFS semi-structured files Storage Binary, Byte Aligned format 9

Analytics in base MPP DBMS s are adding the ability to execute analytic code within their runtimes Parallel UDF Model Executes inline with the query and close to the data SQL Runtime can parallelize the functions Different than a co-processor model Co-processor SQL and Hadoop analytics side by side Analytics in DB Analytic functions parallelized inside of SQL 10

Testing Text Analytics In base set about 100K Wikipedia articles Four Text queries Q1 requires a machine learning algorithm (CRFs) written in a 3GL Q2-4 can be expressed as SQL with text operators Vertica Hadoop Q1 447 seconds (SQL/UDF) 447 seconds (MapReduce) Q2 0.356 seconds (SQL) 127 seconds (via PigLatin) Q3 0.541 seconds (SQL) 172 seconds (via PigLatin) Q4 24 seconds (SQL) 279 seconds (via PigLatin) 11

Comparing the Strengths Hadoop Schemaless Model Human query optimization The ability to create complex dataflows with multiple inputs and outputs Parallelize many Analytic Functions Adept at unstructured classification and feature extraction MPP DBMS Performance Machine query optimization Mature workload management High concurrency interactive query processing 12

A Framework for Discussing Real Time Batch Query/Analysis Traditional Apps Manipulation Business Intelligence Unstructured Apps App Server/ Web Server File System RDBMS ETL Unstructured Feature Extracton/Classification Staging Area Marts EDW Query, Reporting and ROLAP External Events and (eg. twitter feeds, sensor data, clickstream) Event Processing / Web Apps CEP/Stream Processing NoSQL/ NewSQL Profiles/Models/Scoring Analytic Sandbox Analytic R-DBMS File System Analytic Tools/ Prep

A Framework for Discussing Real Time Batch Query/Analysis Traditional Apps Manipulation Business Intelligence External Events and (eg. twitter feeds, sensor data, clickstream) Traditional Unstructured Apps App Server/ Web Server File System (RDBMS/File System) RDBMS Event Processing / Web Apps NoSQL CEP/Stream Processing NoSQL/ NewSQL (KVS/Document) ETL Unstructured Feature Extracton/Classification Hadoop Profiles/Models/Scoring Staging Area Analytic Sandbox Analytic R-DBMS File System Marts EDW Analytic Tools/ Prep Query, Reporting and ROLAP MPP RDBMS

Leveraging a Co-Processor Approach Understand Bus. Determine business objectives Understand business requirements Determine data mining goals Understand Collect initial data Describe data Explore data Verify data quality DBMS Deployment Publish a report Deliver a score file Implement a model within an event processor HBase Cassandra Mongo Membase etc. Evaluation Evaluate results Review process Determine next steps Preparation Select data Clean data Construct data Integrate data Format data Modeling Select modeling technique Generate test design Build model Assess model Hadoop or Analytic DBMS Hadoop or Analytic DBMS Key - with the right architecture it is cheaper to move the data once 15

The rules of my industry 1. The enemy of great is good enough 16

The rules of my industry 1. The enemy of great is good enough 2. And the enemy of good enough is free 17

How might this change in the future Query Optimization Improvements in Hive Statistics, better join ordering, more join types, etc Startup Time Improvements through Drill Static Runtime like HBase rather than Mapreduce Nested Model means simpler query plans to pass out Runtime Performance Improvements through Drill Nested model eliminates many joins and flattens query trees Columnar, compressed storage optimizes access methods Hive/Drill could evolve to be an interesting analytic co-processor 18

19