Big Data Landscape for Databases



Similar documents
Splice Machine: SQL-on-Hadoop Evaluation Guide

HDP Hadoop From concept to deployment.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Oracle Database 12c Plug In. Switch On. Get SMART.

The First Hybrid, In-Memory RDBMS Powered by Hadoop and Spark

Trafodion Operational SQL-on-Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Can the Elephants Handle the NoSQL Onslaught?

How To Handle Big Data With A Data Scientist

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Enterprise Operational SQL on Hadoop Trafodion Overview

INTRODUCTION TO CASSANDRA

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How To Scale Out Of A Nosql Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A Scalable Data Transformation Framework using the Hadoop Ecosystem

SAP Real-time Data Platform. April 2013

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Moving From Hadoop to Spark

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Tap into Hadoop and Other No SQL Sources

Using RDBMS, NoSQL or Hadoop?

Native Connectivity to Big Data Sources in MSTR 10

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

HDP Enabling the Modern Data Architecture

The Future of Data Management

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Navigating the Big Data infrastructure layer Helena Schwenk

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Luncheon Webinar Series May 13, 2013

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Structured Data Storage

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Actian SQL in Hadoop Buyer s Guide

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Apache Cassandra

Big Data Success Step 1: Get the Technology Right

Hadoop Ecosystem B Y R A H I M A.

The Internet of Things and Big Data: Intro

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Lecture Data Warehouse Systems

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

The 3 questions to ask yourself about BIG DATA

Reference Architecture, Requirements, Gaps, Roles

Building Your Big Data Team

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Presenters: Luke Dougherty & Steve Crabb

Using distributed technologies to analyze Big Data

Cisco IT Hadoop Journey

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data Analytics - Accelerated. stream-horizon.com

Architectures for Big Data Analytics A database perspective

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

The Principles of the Business Data Lake

Dell In-Memory Appliance for Cloudera Enterprise

Oracle Database: SQL and PL/SQL Fundamentals NEW

NoSQL for SQL Professionals William McKnight

Oracle Database In-Memory The Next Big Thing

Big Data Technologies Compared June 2014

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

CitusDB Architecture for Real-Time Big Data

Business Intelligence for Big Data

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Performance and Scalability Overview

Accelerating and Simplifying Apache

More Data in Less Time

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How to Choose Between Hadoop, NoSQL and RDBMS

Transcription:

Big Data Landscape for Databases Bob Baran Senior Sales Enginee rbaran@splicemachine.com! May 12, 2015

Typical Database Workloads OLTP Applications Real-Time Web, Mobile, and IoT Applications Real-Time, Operational Reporting Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases MySQL Oracle MongoDB Cassandra MySQL Oracle MySQL Oracle Greenplum Paraccel Netezza Teradata Oracle Sybase IQ Use Cases ERP, CRM, Supply Chain Web, mobile, social IoT Operational Datastores Crystal Reports Exploratory Analytics Data Mining Enterprise Reporting Workload Strengths Real-time updates ACID transactions High concurrency of small reads/ writes Range queries Real-time updates High ingest rates High concurrency of small reads/ writes Range queries Real-time updates Canned, parameterized reports Range queries Complex queries requiring full table scans Append only Parameterized reports against historical data Operational Analytical 2

Recent History of RDBMSs RDBMS Definition Relational with joins ACID transactions Secondary indexes Typically row-oriented Operational and/or analytical workloads By early 2000s Limited innovation Looked like Oracle and Teradata won 3

Hadoop Shakes Up Batch Analytics Data processing framework Cheap distributed file system Brute force, batch processing through MapReduce Great for batch analytics Great place to dump data to look at later 4

NoSQL Shakes Ups Operational DBs NoSQL wave Companies like Google, Amazon and LinkedIn needed greater scalability & schema flexibility New databases developed by developers, not database people Provided scale-out, but lost SQL Worked well at web startups because: In some cases, use cases did not need ACID Willing to handle exceptions at app level 5

Convoluted Evolution of Databases Hadoop 2005 NoSQL Databases 2010 Scale-out SQL Databases 2013 Scale Out Sc ala bili ty Traditional RDBMSs 1980s-2000s Scale Up Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Functionality 6

Mainstream user changes Driven by web, social, mobile, and Internet of Things Major increases in scale 30% annual data growth Significant requirements for semi-structured data Though relatively little unstructured Technology adoption continuum What is it? Scale-out SQL DBs for operational apps Should I use it? NoSQL for web apps Hadoop technologies for analytics Why wouldn t I use it? Cloud 7

Schema on Ingest vs. Schema on Read Data Stream Schema on Ingest Schema on Read Application Structured data should always remain structured Schema on Read if you only use data a few times a year Add schema if data used regularly Even schemaless MongoDB requires schema - 10 Things You Should Know About Running MongoDB At Scale By Asya Kamsky, Principal Solutions Architect at MongoDB Item #1 have a good schema and indexing strategy #

Scale-out is the future of databases How do I scale? Scale Up Scale Out NoSQL NewSQL SQL-on- Hadoop MPP Hadoop RDBMS Analytic Engines 9

NoSQL Pros Cons Easy scale-out Flexible schema Easier web development with hierarchical data structures (MongoDB) Cross-data center replication (Cassandra) No SQL requires retraining and app rewrites No joins i.e., no cross row/ document dependencies No reliable updates through transactions across rows/tables Eventual consistency (Cassandra) Not designed to do aggregations required for analytics 10

NewSQL Pros Cons Easy scale-out ANSI SQL eliminates retraining and app rewrites Reliable updates through ACID transactions RDBMS functionality Strong cross-data center replication (NuoDB) Proprietary scale-out, unproven into petabytes Must manage another distributed infrastructure beyond Hadoop Can not leverage Hadoop ecosystem of tools 11

NewSQL In-Memory Pros Cons Easy scale-out High performance because everything in memory ACID transactions within nodes Memory 10-20x more expensive Limited SQL Limited cross-node transactions Proprietary scale-out, unproven into petabytes Must manage another distributed infrastructure beyond Hadoop Can not leverage Hadoop ecosystem 12

Operational RDBMS on Hadoop Pros Cons Easy scale-out Scale-out infrastructure proven into petabytes ANSI SQL eliminates retraining and app rewrites Reliable updates through ACID transactions Leverages Hadoop distributed infrastructure and tool ecosystem Full table scans slower than MPP DBs, but faster than traditional RDBMSs Existing HDFS data must be reloaded through SQL interface 13

MPP Analytical Databases Pros Cons Easy scale-out Very fast performance for full table scans Highly parallelized, shared nothing architectures May have columnar storage (Vertica) No maintenance of indexes (Netezza) Poor concurrency models prevent support of real-time apps Poor performance for range queries Need to redistribute all data to add nodes (hash partitioning) May require specialized hardware (Netezza) Proprietary scale out - can not leverage Hadoop ecosystem of tools 14

SQL-on-Hadoop Analytical Engines Pros Cons Easy scale-out Scale-out proven into petabytes Leverages Hadoop distributed infrastructure Can leverage Hadoop ecosystem of tools Relatively immature, especially compared to MPP DBs Limited SQL Poor concurrency models prevent support of real-time apps No reliable updates through transactions Intermediate results must fit in memory (Presto) 15

Future: Hybrid In-Memory Architectures Memory Cache with Disk - Unsophisticated memory management Pure In-Memory - Very expensive Hybrid In-Memory - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views? 16

Summary Future of Databases Predicted Trends Scale-out dominates databases Developers stop worrying about data size and develop new data-driven apps Hybrid in-memory architecture becomes mainstream Predicted Winners Hadoop becomes de facto distributed file system NoSQL used for simple web apps Scale-out SQL RDBMSs replace traditional RDBMSs 17

Questions? Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015

Powering Real-Time Apps on Hadoop Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015

Who Are We? THE ONLY HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out Commodity hardware Elastic Easy to expand or scale back 10x Better Price/Perf Transactional Real-time updates & ACID Transactions ANSI SQL Leverage existing SQL code, tools, & skills Flexible Support operational and analytical workloads #

What People are Saying Recognized as a key innovator in databases Quotes Scaling out on Splice Machine presented some major benefits over Oracle...automatic balancing between clusters...avoiding the costly licensing issues. An alternative to today s RDBMSes, Splice Machine effectively combines traditional relational database technology with the scale-out capabilities of Hadoop. The unique claim of Splice Machine is that it can run transactional applications as well as support analytics on top of Hadoop. Awards 21

Advisory Board Advisory Board includes luminaries in databases and technology Mike Franklin Computer Science Chair, UC Berkeley Director, UC Berkeley AmpLab Founder of Apache Spark Roger Bamford Former Principal Architect at Oracle Father of Oracle RAC Marie-Anne Neimat Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle Ken Rudin Head of Analytics at Facebook Former GM of Oracle Data Warehousing 22

Combines the Best of Both Worlds Hadoop Scale-out on commodity servers Proven to 100s of petabytes Efficiently handle sparse data Extensive ecosystem RDBMS ANSI SQL Real-time, concurrent updates ACID transactions ODBC/JDBC support #

Focused on OLTP and Real-Time Workloads OLTP Applications Real-Time Web, Mobile, and IoT Applications Real-Time, Operational Reporting Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases MySQL Oracle MySQL Oracle MongoDB Cassandra MySQL Oracle Greenplum Paraccel Netezza Teradata Oracle Sybase IQ Use Cases ERP, CRM, Supply Chain Web, mobile, social IoT Operational Datastores Crystal Reports Exploratory Analytics Data Mining Enterprise Reporting Workload Strengths Real-time updates ACID transactions High concurrency of small reads/ writes Range queries Real-time updates High ingest rates High concurrency of small reads/ writes Range queries Real-time updates Canned, parameterized reports Range queries Complex queries requiring full table scans Append only Parameterized reports against historical data 24

OLTP Campaign Management: Harte-Hanks Overview Digital marketing services provider Unified Customer Profile Real-time campaign management OLTP environment with BI reports Challenges Oracle RAC too expensive to scale Queries too slow even up to ½ hour Getting worse expect 30-50% data growth Looked for 9 months for a cost-effective solution Solution Diagram Cross-Channel Campaigns Real-Time Personalization Real-Time Actions Initial Results 10-20x price/perf with no application, BI or ETL rewrites ¼ cost with commodity scale out 3-7x faster through parallelized queries 25

Reference Architecture: Operational Data Lake Offload real-time reporting and analytics from expensive OLTP and DW systems ERP CRM OLTP Systems Stream or Batch Updates Operational Data Lake ETL Data Warehouse Executive Business Reports Supply Chain Ad Hoc Analytics HR Datamart Operational Reports & Analytics Real-Time, Event-Driven Apps #

Streamlining the Structured Data Pipeline in Hadoop ERP CRM Source Systems Sqoop Traditional Hadoop Pipeline Apply Inferred Schema SQL Query Engines BI Tools Stored as flat files vs. ERP CRM Source Systems Streamlined Hadoop Pipeline Exisiting ETL Tool Stored in same schema BI Tools Advantages Reduced operational costs with less complexity Reduced processing time and errors with fewer translations Real-time updates for data cleansing Better SQL support 27

Complementing Existing Hadoop-Based Data Lakes Optimizing storage and querying of structured data as part of ELT or Hadoop query engines Pig ERP CRM OLTP Systems 1 HCATALOG 3 SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data Supply Chain HR SCHEMA ON INGEST: Streamlined, structured-tostructured integration Structured Data Unstructured Data 2 SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data #

Proven Building Blocks: Hadoop and Derby APACHE DERBY ANSI SQL-99 RDBMS Java-based ODBC/JDBC Compliant! APACHE HBASE/HDFS Auto-sharding Real-time updates Fault-tolerance Scalability to 100s of PBs Data replication #

HBase: Proven Scale-Out Auto-sharding Scales with commodity hardware Cost-effective from GBs to PBs High availability thru failover and replication LSM-trees #

Splice Optimizations to HBase Splice Storage is optimized over raw HBase We use Bitmap Indexes to store data in packed byte arrays This approach allows us to store data in a much smaller footprint than traditional HBase With a TPCH schema, we found a 10X reduction in data size reduction Requires far less hardware and resources to perform the same workload Asynchronous Write Pipeline HBase writes (puts) are not pipelined and block while the call is being made Splice s write pipeline allows us to reach speeds of over 100K writes / second per HBase node This allows extremely high ingest speeds without requiring more hardware and custom code Transactions As scalability increases, the likelihood of failures increases We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data RDBMS Capabilities The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data Core Data Management functions (Indexes, Constraints, typed columns, etc.) 31

Distributed, Parallelized Query Execution Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce HBase Co-Processor!HBase Server Memory Space L G N E E D #

ANSI SQL-99 Coverage Data types e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT Conditional functions e.g., CASE, searched CASE DDL e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE Privileges e.g., privileges for SELECT, DELETE, INSERT, EXECUTE Predicates e.g., IN, BETWEEN, LIKE, EXISTS Cursors e.g., updatable, read-only, positioned DELETE/UPDATE DML e.g., INSERT, DELETE, UPDATE, SELECT Joins e.g., INNER JOIN, LEFT OUTER JOIN Query specification e.g., SELECT DISTINCT, GROUP BY, HAVING SET functions e.g., UNION, ABS, MOD, ALL, CHECK Transactions e.g., COMMIT, ROLLBACK, READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation Sub-queries Aggregation functions e.g., AVG, MAX, COUNT String functions e.g., SUBSTRING, concatenation, UPPER, LOWER, POSITION, TRIM, LENGTH Triggers User-defined functions (UDFs) Views including grouped views Window Functions (rank, rownumber, ) 33

Window Functions (Advanced Analytics Functions) Analytics such as Running total, Moving averages, Top-N Queries Performs calculations across a set of table rows related to the current row in the window Similar to aggregate functions with two significant differences: Outputs one row for each input value it operates upon. Groups rows with window partitioning and frame clauses vs. Group BY SPLICE MACHINE Currently Supports RANK DENSE_RANK ROW NUMBER AVG SUM COUNT MAX MIN 34

Lockless, ACID transactions Adds multi-row, multi-table transactions to HBase with rollback Fast, lockless, high concurrency Extends research from Google Percolator, Yahoo Labs, U of Waterloo Patent pending technology #

Customer Performance Benchmarks Typically 10x price/performance improvement SPEED 3-7x 20x 7x FASTER VS. PRICE/ PERFORMANCE 10-20x 10x 30x LOWER #

Applications, BI / SQL tool support via ODBC/JDBC #

Splice Machine Safe Journey Process Initial Overview Rapid Assessment Proof of Concept Pilot Project Enterprise Implementation Splice Machine overview Set the stage for Rapid Assessment Half day workshop Assess Splice Machine fit Identity target use cases Risk assessment of use cases Agree upon success criteria Prove client use case on Splice Machine hosted environment Benchmark using customer queries and schema On Customer data or generated data that resembles customer data Identify paid pilot use case with limited change management impact Install Splice Machine on client environment Deploy use case/ application on client data Prove Splice Machine against key requirements Kickstart Requirements Design/Dev QA Test Cutover Hypercare 1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months 38

Safe Journey Enterprise Implementation Stages Kickstart Requirements Design/Dev QA Test Parallel Ops Cutover Hypercare Packaged 2 week program to get new client off to strong start on solid foundation! Incorporates: Splice Architecture & Development courses Risk Assessment Workshop Implementation Blueprint Establish clear functional and performance requirements document! Can be a refresh only if project is a port of an existing app to Splice Based on Agile method. Phase is divided into 2 week sprints! Stories covering a set of required capabilities are assigned to each developer! A design doc is created, code is written, unit tests are written and executed until they pass The QA test period includes: Performance Test End-to-End System Integration Test User Acceptance Test! Depending on scale of project there may be multiple iterations of each test with break/ fix cycles in between Used when an existing system is being ported to Splice Machine from another database! The new Splice Machine-based system runs side by side with the old system for a period of time Optional Formal period in which Splicebased solution goes-live and pre-existing system is deprecated Period of onsite support during cutover and for a period immediately following golive Optional #

Common Risks and Mitigation Strategies Data migration Risk: Clients are typically migrating very large data sets to Splice Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data Solution: First migrate a small subset of tables that contain all required Changes data types. to source Ensure schema these during migrate implementation successfully before migrating the entire database Risk: Changes to the schema of the source database to be migrated during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project Solution: All stakeholders agree up front to freeze the schema as of an Stored agreed procedure upon date conversion prior to the Design/Development stage. Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include significant amounts or procedural code as well as multiple SQL statements Solution: Carefully review the function and design of SPs to be 40

Common Risks and Mitigation Strategies SQL compatibility Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable. Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed. Indexing Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes Hadoop knowledge Risk: Project stakeholders often have limited knowledge of Hadoop and the distributed computing paradigm. This can lead to confusion about the Splice Machine value proposition and the and the advantages of moving to a scale-out architecture Solution: Include the Splice Machine Kickoff Program at the beginning of the implementation project. This includes essential training on Hadoop and related fundamentals concepts critical to realizing value from a Splice Machine deployment 41

Summary THE ONLY HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out Commodity hardware Elastic Easy to expand or scale back 10x Better Price/Perf Transactional Real-time updates & ACID Transactions ANSI SQL Leverage existing SQL code, tools, & skills Flexible Support operational and analytical workloads #

Questions? Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015