Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox [@willcoxmnk], Director Big Data Centre of Excellence (Teradata



Similar documents
What is a Data Lake, anyway? Alec Gardner, GM Advanced Analytics, Teradata ANZ Wednesday 10 th June 2015

Big Data, Start Small! Dr. Frank Säuberlich, Director Advanced Analytics (Teradata International) 26 th May 2015

Teradata s Big Data Technology Strategy & Roadmap

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

TERADATA QUERY GRID. Teradata User Group September 2014

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Native Connectivity to Big Data Sources in MSTR 10

Tap into Hadoop and Other No SQL Sources

Oracle Database 12c Plug In. Switch On. Get SMART.

The Future of Data Management

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

I/O Considerations in Big Data Analytics

Are You Big Data Ready?

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Big Data SQL and Query Franchising

Bringing the Power of SAS to Hadoop. White Paper

Customized Report- Big Data

Trafodion Operational SQL-on-Hadoop

Luncheon Webinar Series May 13, 2013

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

UNIFY YOUR (BIG) DATA

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data must become a first class citizen in the enterprise

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Data Warehouse Hadoop. Shimpei Kodama 2015/9/29

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Exploring the Synergistic Relationships Between BPC, BW and HANA

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Information Architecture

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Too Big To Ignore

Using Tableau Software with Hortonworks Data Platform

HDP Hadoop From concept to deployment.

Actian SQL in Hadoop Buyer s Guide

Big Data on Microsoft Platform

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Cloudera Enterprise Data Hub in Telecom:

Discovering Business Insights in Big Data Using SQL-MapReduce

Data Warehouse Optimization

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

The Celebrus v8 Big Data Engine. Powering real-time personalisation, one-to-one data-driven marketing & advanced customer analytics.

Navigating Big Data business analytics

WHAT S NEW IN SAS 9.4

Big Data Management and Security

THE JOURNEY TO A DATA LAKE

HDP Enabling the Modern Data Architecture

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

TOP 8 TRENDS FOR 2016 BIG DATA

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Cisco IT Hadoop Journey

Big Data Integration: A Buyer's Guide

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

WHITE PAPER Business Process Management: The Super Glue for Social Media, Mobile, Analytics and Cloud (SMAC) enabled enterprises?

How To Turn Big Data Into An Insight

The Enterprise Data Hub and The Modern Information Architecture

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Constructing a Data Lake: Hadoop and Oracle Database United!

Next-Generation Cloud Analytics with Amazon Redshift

Internet of Things. Opportunity Challenges Solutions

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Safe Harbor Statement

Integrating Apache Spark with an Enterprise Data Warehouse

Unifying the Enterprise Data Hub and the Integrated Data Warehouse

Driving Value From Big Data

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Advanced In-Database Analytics

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Dashboard Engine for Hadoop

From Spark to Ignition:

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Data Virtualization A Potential Antidote for Big Data Growing Pains

Cisco Data Preparation

6 Steps to Faster Data Blending Using Your Data Warehouse

The Internet of Things and Big Data: Intro

Artur Borycki. Director International Solutions Marketing

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today

IBM BigInsights for Apache Hadoop

A Whole New World. Big Data Technologies Big Discovery Big Insights Endless Possibilities

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Transcription:

Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox [@willcoxmnk], Director Big Data Centre of Excellence (Teradata International) 4 th June 2015

Agenda A (very!) short history of Teradata The new Big Data and the emergence of the Logical Data Warehouse Hadoop and the Data Lake Intergalactic Data Speak to the rescue Conclusions and final thoughts 2

A (very!) short history of Teradata Big Data before there was Big Data In 1979, four academics and software engineers quit their days jobs, maxed-out their credit cards and built the world s first MPP scaleout Relational Database Computer in a garage in California. 3

A (very!) short history of Teradata 1986: Teradata ships the first commercial 100-node MPP system 4

The new Big Data From transactions and events - to interactions and observations Simple computing devices are now so inexpensive that increasingly everything is instrumented Instead of capturing transactions and events in the Data Warehouse and inferring behaviour, we can increasingly measure it directly Organisations making the transactions, to interactions journey need to address five key challenges 5

The new Big Data The big 5 challenges of making the transactions, to interactions journey #1: The requirement to manage multistructured data and data whose structure changes continuously means that there is no single Information Management strategy that works equally well across the entire Big Data space. #3: The economic challenge of capturing, storing, managing and exploiting Big Data sets that may be large; getting larger quickly; noisy; of (as yet) unproven value; and infrequently accessed. 6 #2: Understanding Interactions requires path / graph / time-series Analytics in addition to traditional set-based Analytics, so that there isn t a single parallel processing framework or technology that works equally well across the entire Big Data space. #4: There might be a needle in one of these haystacks - but if it takes 6-12 months and $1M just to go look, I ll never know. #5: Getting past so what to drive real business value (because old business process + expensive new technology = expensive, old business process)

The Logical Data Warehouse is the industry s adaptation to Big Data How will you deploy? How many / which platforms will you need? How will you integrate them? And which data need to be centralised and integrated? The Enterprise Data Warehouse Era The Logical Data Warehouse (a.k.a.: Unified Data Architecture) Era 1 Multi-structured data 2 Interaction / observation Analytics 5 3 4 Flat / falling IT budgets, exploding data volumes Agile Exploration & Discovery 1 3 2 4 Give me integrated, high quality data. 5 Operationalisation Centralise and integrate the data that are widely reused and shared, but integrate all of the analytics. 7

Big Idea #1: store all data (whatever all means) Big Idea #2: un-washed, raw data (NoETL / late-binding) Hadoop and the Data Lake (Data Warehouse professionals can be excused a certain sense of déjà vu where #4 is concerned!) Big Idea #3: leverage multiple technologies to support processing flexibility Big Idea #4: resolve the nagging problem of accessibility and data integration 8

The Data Lake will be ubiquitous, but Working in the Hadoop ecosystem is the province of uniquely trained engineers, people Maguire calls unicorns. Companies may have talented data teams, he says, but they should expect to supplement and rebuild their teams to make Hadoop successful. The talent gap is huge, says Maguire. What you need is somebody who knows 15 different technologies That drives up TCO. Walter Maguire, Chief Technologist, HP Big Data, quoted in a blog post on http://www8.hp.com/ 9

Intergalactic Data Speak* to the rescue! *v (with apologies to Rick van der Lans and Chris Date, respectively) It s messy and imperfect; There are (already) many different dialects; Most implementations are a superset of a subset v of the standard; But it s also The Data Lingua Franca; Declarative, rather than imperative / procedural. 10

SQL-based Query Processing on Hadoop RDBMS HDFS QUERY ENGINE HDFS RDBMS HADOOP HADOOP RDBMS DATA VIRTUALIZATION RDBMS On Top Of Hadoop Query Engine Using HDFS Files RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive Virtualization Layer Over All Data Sources 11

Query Processing on Hadoop RDBMS On Top of Hadoop RDBMS HDFS RDBMS on Hadoop cluster Proprietary data dictionary/meta data Proprietary data format within HDFS files Data types may be limited SQL query engine SQL language, but standards compatibility varies Query engine maturity varies Data not portable, can not be read by other systems/ engines Example: Pivotal HAWQ 12

Query Processing on Hadoop Query Engine Using HDFS Files QUERY ENGINE HDFS SQL query engine on Hadoop cluster Standard data dictionary/meta data (e.g., Hive) Standard data format within HDFS files (e.g., ORC files) Data types may be limited SQL query engine SQL language, but standards compatibility varies Query engine maturity varies Data portable and can be read by other systems/ engines Examples: IBM Big SQL, Cloudera Impala 13

Query Processing on Hadoop RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive RDBMS HADOOP External RDBMS sends (part of) queries to engine on Hadoop Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) Standard data format within HDFS files (e.g., ORC) Data types may be limited by engine on Hadoop and external RDBMS SQL query engine capabilities combination of external and internal Hadoop engines Combines data and analytics in two systems SQL language, standards compatibility generally high Query engine generally mature Data in Hadoop portable and can be read by other systems/engines Example: Teradata QueryGrid 14

Query Processing on Hadoop Virtualization Layer Over All Data Sources HADOOP RDBMS DATA VIRTUALIZATION External virtualization software sends (part of) queries to engine on Hadoop Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) Standard data format within HDFS files (e.g., ORC) Data types may be limited by engine on Hadoop and external virtualization software SQL query engine capabilities combination of external and Hadoop engines and virtualization layer limitations Combines data and analytics in two systems Extra layer and/or data movement SQL language, standards compatibility generally high Query engine maturity and utilization of engines varies Data in Hadoop portable, can be read by other engines Example: Cisco Data Virtualization Platform (formerly Composite Software) 15

Teradata QueryGrid Optimize, simplify, and orchestrate processing across and beyond the Teradata UDA Run the right analytic on the right platform Take advantage of specialized processing engines while operating as a cohesive analytic environment Integrated processing; within and outside the UDA Easy access to data and analytics through existing SQL skills and tools Automated and optimized work distribution through push-down processing across platforms Minimize data movement, process data where it resides Minimize data duplication Transparently automate analytic processing and data movement between systems Bi-directional data movement 16 2014 Teradata

Years 1-5 Deep History QueryGrid Teradata 15.00 Use Case SELECT Trans.Trans_ID,Trans.Trans_Amount FROM TD_Transactions Trans WHERE Trans_Amount > 5000 TERADATA DATABASE UNION SELECT * FROM FOREIGN TABLE (SELECT Trans_ID,Trans_Amount FROM Transaction_Hist WHERE Trans_Amount > 5000)@Hadoop Hist; Years 5-10 HADOOP Push "Foreign Table" Select to Hive to execute the query Provides import to Teradata of just the required columns. Allows predicate processing of conditions on non-partitioned columns. The Hadoop cluster resources are used for data qualification. 17

18 Adaptive Optimizer Incremental planning & execution of smaller query fragments Most efficient overall query plan derived from reliable statistics Statistics dynamically collected from foreign data Incremental query plans generated for single and multi-system queries Consistent Optimizer approach for queries within and between systems Teradata systems transfer query plans between systems A fully automatic optimizer feature users don t have to change anything Better Query Plan Foreign and Sub-Queries Why? Unreliable statistics can result in less-thanoptimal query plans Some analytic systems, like Hadoop, don t keep data statistics Statistics not designed for compatibility between databases How? Pulls out remote server requests and single-row and scalar non-correlated subqueries from a main query Plans and executes them Plugs the results into the main query Plans and executes the main query

19 2014 Teradata Summary & conclusions

Analysts agree that the Logical Data Warehouse is the future of Enterprise Analytical Architecture Gartner Logical Data Warehouse even if they can t agree what to call it Forrester Enterprise Data Hub We will abandon the old models based on the desire to implement for high-value analytic applications. Raw data in an affordable distributed data hub Firms that get this concept realise all data does not need first-class seating. 20 2014 Teradata

There are (already) 12+ different SQL interfaces for Hadoop Source: Gartner Market Guide for Hadoop Distributions, 6 th January 2015 Apache Drill Apache Phoenix Apache Tajo IBM BigSQL Pivotal Hawq Splice Machine Teradata QueryGrid Apache Hive Apache Spark SQL Cloudera Impala Oracle Big Data SQL Presto SQLstream Broad industry consensus that SQL is a key enabler in making the Hadoop Ecosystem accessible to mere mortals; The different technologies have very different strengths and weaknesses and you may struggle to standardise on only one of them, but 21 2014 Teradata

at least right now, the sweet-spot is in the middle of the spectrum Not open enough Both sound architectural choices, depending on use-case Not fast / scalable enough RDBMS HDFS QUERY ENGINE HDFS RDBMS HADOOP HADOOP RDBMS DATA VIRTUALIZATION RDBMS On Top Of Hadoop Query Engine Using HDFS Files RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive Virtualization Layer Over All Data Sources 22

Final thoughts What makes Hadoop special is all the things that it can do that parallel RDBMS technologies cannot; Industry focus on SQL interfaces is a rational way of addressing accessibility / TCO issues but risk is that we re-invent (lowestcommon denominator) parallel RDBMS technologies; Your goal should not be to try and recreate your IDW on Hadoop (you will likely fail), but to build a Data Lake to capture new data and support new processing... 23

so start with a business goal, not with a technology Web / clickstream Who navigates to the website, what do they do in each session and then afterwards within other channels? Voice / text Who is complaining to the call center & about what? 24 2014 Teradata E-mail / Graph Which brokers are colluding to rig markets and with whom? Sentiment What are customers saying about the company / products / services on social media sites? Process / Path Analytics What s the optimal process for claims or collections activity?

25 2015 Teradata