Trafodion Operational SQL-on-Hadoop

Similar documents

Enterprise Operational SQL on Hadoop Trafodion Overview

Apache Trafodion. Table of contents ENTERPRISE CLASS OPERATIONAL SQL-ON-HADOOP

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CitusDB Architecture for Real-Time Big Data

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Splice Machine: SQL-on-Hadoop Evaluation Guide

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Big Data Management and Security

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop IST 734 SS CHUNG

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Innovative technology for big data analytics

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

A modern, flexible approach to Hadoop implementation incorporating innovations from HP Vertica & IDOL

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA What it is and how to use?

Apache Hadoop: Past, Present, and Future

Oracle Big Data SQL Technical Update

The HP Neoview data warehousing platform for business intelligence

I/O Considerations in Big Data Analytics

Big Data and Market Surveillance. April 28, 2014

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Large scale processing using Hadoop. Ján Vaňo

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Workshop on Hadoop with Big Data

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

How To Scale Out Of A Nosql Database

Advanced Big Data Analytics with R and Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Implement Hadoop jobs to extract business value from large and varied data sets

Apache HBase. Crazy dances on the elephant back

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Cost-Effective Business Intelligence with Red Hat and Open Source

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

A Brief Introduction to Apache Tez

A Scalable Data Transformation Framework using the Hadoop Ecosystem

How To Handle Big Data With A Data Scientist

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Using distributed technologies to analyze Big Data

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Hadoop. Sunday, November 25, 12

Apache Kylin Introduction Dec 8,

HDP Hadoop From concept to deployment.

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

ITG Software Engineering

On a Hadoop-based Analytics Service System

Hadoop Ecosystem B Y R A H I M A.

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Transforming the Telecoms Business using Big Data and Analytics

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Instant-On Enterprise

Making Sense of Big Data in Insurance

Native Connectivity to Big Data Sources in MSTR 10

Parallel Data Warehouse

Testing Big data is one of the biggest

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data Can Drive the Business and IT to Evolve and Adapt

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The Sierra Clustered Database Engine, the technology at the heart of

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop Big Data for Processing Data and Performing Workload

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Moving From Hadoop to Spark

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Choosing The Right Big Data Tools For The Job A Polyglot Approach

How to Choose Between Hadoop, NoSQL and RDBMS

Il mondo dei DB Cambia : Tecnologie e opportunita`

NoSQL for SQL Professionals William McKnight

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upcoming Announcements

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

An Oracle White Paper October Oracle: Big Data for the Enterprise

Certified Big Data and Apache Hadoop Developer VS-1221

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

SEIZE THE DATA SEIZE THE DATA. 2015

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Data Modeling for Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

<Insert Picture Here> Oracle In-Memory Database Cache Overview

How to Enhance Traditional BI Architecture to Leverage Big Data

Transcription:

Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015

Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL = OLTP + interactions Parameterized reports Drilldown visualization Exploration Data preparation Incremental batch processing Dashboards, scorecards Operational batch processing Enterprise reports Data mining Sub-second Response Time Hours 2

Characteristics of operational DBMS applications Generalized characteristics and requirements: Low latency response times ACID (data consistency guaranteed) transactions Large number of users High concurrency High availability Scalable data volumes Multi-structured data Rapidly evolving data requirements (i.e. flexible schemas) Expose Hadoop limitations Transaction Support Data Integrity Real-time Performance Operational Query Optimization Workload Management 3

Trafodion - Introduction Trafodion is a joint HP Labs and HP-IT research project to develop operational SQL on Hadoop database capabilities Complete: Full-function SQL Reuse existing SQL skills and improve developer productivity Protected: Distributed ACID transactions Guarantees data consistency across multiple rows, tables, SQL statements Efficient: Optimized for low-latency read and write transactions Supports real-time transaction processing applications Interoperable: Standard ODBC/JDBC access Works with existing tools and applications Open: Hadoop and Linux distribution neutral Easy to add to your existing infrastructure and no vendor lock-in Hadoop + Operational SQL Leveraging 20+ years of operational database investment! 4

Operational SQL on Hadoop Use cases Integration of structured, semistructured, and unstructured support Using Hadoop for all RDBMS needs Not Only Big Data Operational transactional workloads Free at last! Capture data directly into open file structures Open distributed HDFS structures HBase & Hive Structured Semi- structured Item id Description Cost Price TV Book Unstructured Image Review Type Display Size Resolution Brand Model 3D ISBN Author Publish Date Format Dept SELECT all TVs WHERE Price > 2000 and Type = Plasma and Display Size > 50 and customer sentiment is very positive Accessible for reporting & analytics with no latency 5

Trafodion innovation built upon Hadoop stack Leverage Hadoop and HBase for core modules Maintain API compatibility Differentiation Client Application using ODBC or JDBC T4 Driver Client Services for ODBC and JDBC ANSI SQL via ODBC/JDBC Relational schema abstraction ACID transactions Low latency read/writes Parallel optimizations Zookeeper SQL Compiler / Optimizer / Executor Distributed Transaction Manager HBase Hive Standard Hadoop Trafodion HDFS 6

Leveraging HBase for scalability and availability 7 Table Logical View Clustering key RK CF CN TS CV 1 F A 1 F B 1 F C 2 F A 2 F C 3 F C 4 F A 4 F B 5 F A 5 F B 5 F C 6 F A 7 F A 7 F B 7 F C 8 F B 9 F A 9 F B 9 F C Regions Physical Layout 1 F A 1 F B 1 F C 2 F A 2 F C 3 F C 4 F A 4 F B 5 F A 5 F B 5 F C 6 F A 7 F A 7 F B 7 F C 8 F B 9 F A 9 F B 9 F C Region Server Layer Region Server HDFS Region Server HDFS Region Server HDFS Region Server HDFS Region Server HDFS HBase Trafodion RK A B C. Client RK Row Key CF Column Family F CN Column Name A, B, C TS Timestamp One version CV Cell Value Regions store contiguous ranges of table rows Regions dynamically split by HBase when they reach a configured limit i.e. autosharding Even data distributions across HBase regions with Salting Region servers are elastically scalable HDFS and HBase replication provide enhanced data availability and protection Allows Fine-grained load balancing with dynamic movement based on load Fast data recovery when a server or disk fails or is decommissioned Data in different Column Families are stored separately

Distributed transaction protection Multiple row inserts, updates, and deletes to a table Multiple table and SQL insert, update, and delete statements Distributed multiple HBase region insert, update, and delete transaction (2-phase commit) Read-only transaction (eliminates commit overhead) Table A... 1 3 Region B Region C Trafodion Region A Region D Table A Table B 2 4 Table C 8

Optimized execution plans based on statistics Optimizer features Top-down, multi-pass optimizations, branch and bound plan pruning considers more potential plans Utilizes equal-height histogram statistics SQL pushdown considerations e.g. predicate evaluation Eliminates sorts when feasible, syntactically and semantically In-memory vs. overflow considerations Optimal degree of parallelism (DOP) considerations including non-parallel plans Benefits Facilitates enhanced parallelism and SQL object handling efficiencies Optimizations for operational transactions and reporting workloads SQL Analyzer SQL Statement SQL Normalizer Plan Generator Optimized Plan Table Statistics Cardinality Estimator Cost Estimator 9

Trafodion performance objective YCSB operation speeds that approach HBase (within 20%) YCSB Singleton5050 (Workload A) Throughput (OPS) Meets current objective! With max variance at 10.8% 0 128 256 384 512 640 768 896 1,024 Concurrency (Streams) Traf 1.1 HBase 10

Trafodion performance objective YCSB and Order Entry scale linearly through 20 nodes (240 cores)! Transactional Order Entry Throughput Selects Throughput Updates Throughput YCSB Throughput 50/50 Meets objective! 11

Profiling Trafodion use cases HBase Expansion Expand workloads onto HBase and maximize return out of HBase/Hadoop investment Assumptions Hadoop and HBase already exist, and interested to expand it to process more operational workloads Value Proposition Maximize Return out of current HBase/Hadoop Investment through processing more workloads Key Technical Requirements Use a powerful SQL engine to query / update HBase Ensure ACID distributed transaction protection is a key Need to federate data between multiple sources very efficiently in a single query Provide performance, scalability and HA to meet SLA RDBMS Migration Operational workload consolidation and migration onto Big Data platform Assumptions Don t want to have a lock-in to RDBMS vendor proprietary technology (neutral or in favor of open source model) need to reduce overall license cost Interested in or already have a Hadoop centric Big Data strategy in place Value Proposition Avoid vendor lock-in and reduce overall cost of data technologies Leveraging existing application investment while meeting elastic scalability with schema flexibility Key Technical Requirements Need to resolve scalability issues from RDBMS Strong elastic scaling requirements Interested in establishing Big Data platform on Hadoop eco-system Need to leverage existing investment in application with SQL interface capability Clearly definition of operational performance requirements Need to have ACID distributed transaction protection with expected performance and HA to meet SLA requirements on operational workloads) Operational and Analytics Together Scalable, flexible and high performing Big Data platform that combines both operational and analytic workloads Assumptions Already committed to Hadoop, and plan to establish a Hadoop centric big data ecosystem (e.g., data lake, etc.) Value Proposition Reduce overall time-to-value to process data from operation to analysis Key Technical Requirements Integrate structured, semi-structured, and unstructured data into an Enterprise Data Lake to host operational, historical, external (Big Data), and Master data Gain insights across this Enterprise data not limited to just Big Data Eliminate massive data movement across platform between operational and analysis, dramatically reduce time-to-value Reduce movement and replication of data between databases Perform operational / transactional processing to ensure data quality, integrity and transformation on Hadoop 12

Vehicle telemetry operational data store Real time capture, monitoring, and analysis at scale with high concurrency Challenge Thousands of devices transmitting today, 2x in couple of years Real time onboarding 100 s of million of events every day Query the data upon arrival along with historical data in real time Scalability Requirement Sub-second query response times Sustain performance as concurrent users increase to >100 Proposed Solution Trafodion on standard x86 and Linux cluster Data load, query and extract in parallel User can query both current and historical data Benefits TRAFODION Data driven business management Central tracking and querying Alarm reconciliation Fleet health management Scalable, reliable, and real time Open source with no vendor lock-in 13

Re-hosting of telco operational DBMS systems Single data management platform for rating, CRM, funds, and billing systems Challenges: Scale, Mixed Workload Load Voice, SMS and data files generated by 100 s of million users across the country Gigabytes of data to be loaded in minutes Rating Process raw arriving data and load into Trafodion Rate freshly loaded data to track usage within minutes CRM High transaction rates Comprehensive queries against historical data with high concurrency Funds High transaction updates High concurrency Billing Complete 100 s of million user accounts processing at the end of month within hours Proposed Solution Benefits Voice Rating SMS Load Data FUNDS Single integrated and scalable data management platform Open source with no vendor lock-in Billing 14

Trafodion summary Delivers a full featured and optimized operational SQL on Hadoop DBMS solution with full transactional data protection Leveraged use of SQL learnings and expertise versus map/reduce programming Foundation for next generation real-time transaction processing applications Guaranteed transactional consistency across multiple SQL statements, tables, and rows Retention of Hadoop benefits i.e. reduced cost, scalability, elasticity, etc. Transaction Support + + + + Data Integrity Real-time Performance Operational Query Optimization Workload Management 15

HP Trafodion V1.1.0 ( Open Source since June 2014) Forrester - Mike Gualtieri (October 22nd, 2013) The Future of Hadoop is real time and transactional Doug Cutting (October 30th, 2013) We're in the middle of a revolution in data processing it is inevitable that we will see just about every kind of workload be moved to this platform even OnLine Transaction Processing (OLTP) Project Trafodion 1.1 entered Apache Incubator in May 2015 For more information on Trafodion contact Rohit Jain at rohit.jain@esgyn.com 16

SCS Cluster / Big Data Workgroup More than 25 Academic and Industrial partners in PACA Combine Big Data competencies from SCS members Align with the SCS Cluster's strategy Deliver documents on Big Data architecture Test and validate Big Data approaches via Proof of Concept Support the emergence of collaborative projects (French, European) with SCS Cluster http://www.pole-scs.org/p%c3%b4le-scs/smart-specialisation-areas/gt-big-data 17

Thank You Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.