Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox [@willcoxmnk], Director Big Data Centre of Excellence (Teradata International) 4 th June 2015
Agenda A (very!) short history of Teradata The new Big Data and the emergence of the Logical Data Warehouse Hadoop and the Data Lake Intergalactic Data Speak to the rescue Conclusions and final thoughts 2
A (very!) short history of Teradata Big Data before there was Big Data In 1979, four academics and software engineers quit their days jobs, maxed-out their credit cards and built the world s first MPP scaleout Relational Database Computer in a garage in California. 3
A (very!) short history of Teradata 1986: Teradata ships the first commercial 100-node MPP system 4
The new Big Data From transactions and events - to interactions and observations Simple computing devices are now so inexpensive that increasingly everything is instrumented Instead of capturing transactions and events in the Data Warehouse and inferring behaviour, we can increasingly measure it directly Organisations making the transactions, to interactions journey need to address five key challenges 5
The new Big Data The big 5 challenges of making the transactions, to interactions journey #1: The requirement to manage multistructured data and data whose structure changes continuously means that there is no single Information Management strategy that works equally well across the entire Big Data space. #3: The economic challenge of capturing, storing, managing and exploiting Big Data sets that may be large; getting larger quickly; noisy; of (as yet) unproven value; and infrequently accessed. 6 #2: Understanding Interactions requires path / graph / time-series Analytics in addition to traditional set-based Analytics, so that there isn t a single parallel processing framework or technology that works equally well across the entire Big Data space. #4: There might be a needle in one of these haystacks - but if it takes 6-12 months and $1M just to go look, I ll never know. #5: Getting past so what to drive real business value (because old business process + expensive new technology = expensive, old business process)
The Logical Data Warehouse is the industry s adaptation to Big Data How will you deploy? How many / which platforms will you need? How will you integrate them? And which data need to be centralised and integrated? The Enterprise Data Warehouse Era The Logical Data Warehouse (a.k.a.: Unified Data Architecture) Era 1 Multi-structured data 2 Interaction / observation Analytics 5 3 4 Flat / falling IT budgets, exploding data volumes Agile Exploration & Discovery 1 3 2 4 Give me integrated, high quality data. 5 Operationalisation Centralise and integrate the data that are widely reused and shared, but integrate all of the analytics. 7
Big Idea #1: store all data (whatever all means) Big Idea #2: un-washed, raw data (NoETL / late-binding) Hadoop and the Data Lake (Data Warehouse professionals can be excused a certain sense of déjà vu where #4 is concerned!) Big Idea #3: leverage multiple technologies to support processing flexibility Big Idea #4: resolve the nagging problem of accessibility and data integration 8
The Data Lake will be ubiquitous, but Working in the Hadoop ecosystem is the province of uniquely trained engineers, people Maguire calls unicorns. Companies may have talented data teams, he says, but they should expect to supplement and rebuild their teams to make Hadoop successful. The talent gap is huge, says Maguire. What you need is somebody who knows 15 different technologies That drives up TCO. Walter Maguire, Chief Technologist, HP Big Data, quoted in a blog post on http://www8.hp.com/ 9
Intergalactic Data Speak* to the rescue! *v (with apologies to Rick van der Lans and Chris Date, respectively) It s messy and imperfect; There are (already) many different dialects; Most implementations are a superset of a subset v of the standard; But it s also The Data Lingua Franca; Declarative, rather than imperative / procedural. 10
SQL-based Query Processing on Hadoop RDBMS HDFS QUERY ENGINE HDFS RDBMS HADOOP HADOOP RDBMS DATA VIRTUALIZATION RDBMS On Top Of Hadoop Query Engine Using HDFS Files RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive Virtualization Layer Over All Data Sources 11
Query Processing on Hadoop RDBMS On Top of Hadoop RDBMS HDFS RDBMS on Hadoop cluster Proprietary data dictionary/meta data Proprietary data format within HDFS files Data types may be limited SQL query engine SQL language, but standards compatibility varies Query engine maturity varies Data not portable, can not be read by other systems/ engines Example: Pivotal HAWQ 12
Query Processing on Hadoop Query Engine Using HDFS Files QUERY ENGINE HDFS SQL query engine on Hadoop cluster Standard data dictionary/meta data (e.g., Hive) Standard data format within HDFS files (e.g., ORC files) Data types may be limited SQL query engine SQL language, but standards compatibility varies Query engine maturity varies Data portable and can be read by other systems/ engines Examples: IBM Big SQL, Cloudera Impala 13
Query Processing on Hadoop RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive RDBMS HADOOP External RDBMS sends (part of) queries to engine on Hadoop Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) Standard data format within HDFS files (e.g., ORC) Data types may be limited by engine on Hadoop and external RDBMS SQL query engine capabilities combination of external and internal Hadoop engines Combines data and analytics in two systems SQL language, standards compatibility generally high Query engine generally mature Data in Hadoop portable and can be read by other systems/engines Example: Teradata QueryGrid 14
Query Processing on Hadoop Virtualization Layer Over All Data Sources HADOOP RDBMS DATA VIRTUALIZATION External virtualization software sends (part of) queries to engine on Hadoop Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) Standard data format within HDFS files (e.g., ORC) Data types may be limited by engine on Hadoop and external virtualization software SQL query engine capabilities combination of external and Hadoop engines and virtualization layer limitations Combines data and analytics in two systems Extra layer and/or data movement SQL language, standards compatibility generally high Query engine maturity and utilization of engines varies Data in Hadoop portable, can be read by other engines Example: Cisco Data Virtualization Platform (formerly Composite Software) 15
Teradata QueryGrid Optimize, simplify, and orchestrate processing across and beyond the Teradata UDA Run the right analytic on the right platform Take advantage of specialized processing engines while operating as a cohesive analytic environment Integrated processing; within and outside the UDA Easy access to data and analytics through existing SQL skills and tools Automated and optimized work distribution through push-down processing across platforms Minimize data movement, process data where it resides Minimize data duplication Transparently automate analytic processing and data movement between systems Bi-directional data movement 16 2014 Teradata
Years 1-5 Deep History QueryGrid Teradata 15.00 Use Case SELECT Trans.Trans_ID,Trans.Trans_Amount FROM TD_Transactions Trans WHERE Trans_Amount > 5000 TERADATA DATABASE UNION SELECT * FROM FOREIGN TABLE (SELECT Trans_ID,Trans_Amount FROM Transaction_Hist WHERE Trans_Amount > 5000)@Hadoop Hist; Years 5-10 HADOOP Push "Foreign Table" Select to Hive to execute the query Provides import to Teradata of just the required columns. Allows predicate processing of conditions on non-partitioned columns. The Hadoop cluster resources are used for data qualification. 17
18 Adaptive Optimizer Incremental planning & execution of smaller query fragments Most efficient overall query plan derived from reliable statistics Statistics dynamically collected from foreign data Incremental query plans generated for single and multi-system queries Consistent Optimizer approach for queries within and between systems Teradata systems transfer query plans between systems A fully automatic optimizer feature users don t have to change anything Better Query Plan Foreign and Sub-Queries Why? Unreliable statistics can result in less-thanoptimal query plans Some analytic systems, like Hadoop, don t keep data statistics Statistics not designed for compatibility between databases How? Pulls out remote server requests and single-row and scalar non-correlated subqueries from a main query Plans and executes them Plugs the results into the main query Plans and executes the main query
19 2014 Teradata Summary & conclusions
Analysts agree that the Logical Data Warehouse is the future of Enterprise Analytical Architecture Gartner Logical Data Warehouse even if they can t agree what to call it Forrester Enterprise Data Hub We will abandon the old models based on the desire to implement for high-value analytic applications. Raw data in an affordable distributed data hub Firms that get this concept realise all data does not need first-class seating. 20 2014 Teradata
There are (already) 12+ different SQL interfaces for Hadoop Source: Gartner Market Guide for Hadoop Distributions, 6 th January 2015 Apache Drill Apache Phoenix Apache Tajo IBM BigSQL Pivotal Hawq Splice Machine Teradata QueryGrid Apache Hive Apache Spark SQL Cloudera Impala Oracle Big Data SQL Presto SQLstream Broad industry consensus that SQL is a key enabler in making the Hadoop Ecosystem accessible to mere mortals; The different technologies have very different strengths and weaknesses and you may struggle to standardise on only one of them, but 21 2014 Teradata
at least right now, the sweet-spot is in the middle of the spectrum Not open enough Both sound architectural choices, depending on use-case Not fast / scalable enough RDBMS HDFS QUERY ENGINE HDFS RDBMS HADOOP HADOOP RDBMS DATA VIRTUALIZATION RDBMS On Top Of Hadoop Query Engine Using HDFS Files RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive Virtualization Layer Over All Data Sources 22
Final thoughts What makes Hadoop special is all the things that it can do that parallel RDBMS technologies cannot; Industry focus on SQL interfaces is a rational way of addressing accessibility / TCO issues but risk is that we re-invent (lowestcommon denominator) parallel RDBMS technologies; Your goal should not be to try and recreate your IDW on Hadoop (you will likely fail), but to build a Data Lake to capture new data and support new processing... 23
so start with a business goal, not with a technology Web / clickstream Who navigates to the website, what do they do in each session and then afterwards within other channels? Voice / text Who is complaining to the call center & about what? 24 2014 Teradata E-mail / Graph Which brokers are colluding to rig markets and with whom? Sentiment What are customers saying about the company / products / services on social media sites? Process / Path Analytics What s the optimal process for claims or collections activity?
25 2015 Teradata