Integrating Apache Spark with an Enterprise Data Warehouse



Similar documents
IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Who am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Actian Vector in Hadoop

Safe Harbor Statement

Data processing goes big

Inge Os Sales Consulting Manager Oracle Norway

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Moving From Hadoop to Spark

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW

Oracle Database In-Memory The Next Big Thing

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Oracle InMemory Database

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Parquet. Columnar storage for the people

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Apache Kylin Introduction Dec 8,

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Why DBMSs Matter More than Ever in the Big Data Era

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Netezza and Business Analytics Synergy

2015 Ironside Group, Inc. 2

In-Database Analytics

Optimize Oracle Business Intelligence Analytics with Oracle 12c In-Memory Database Option

Focus on the business, not the business of data warehousing!

Oracle Big Data SQL Technical Update

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

The Internet of Things and Big Data: Intro

Cost-Effective Business Intelligence with Red Hat and Open Source

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Architectures for Big Data Analytics A database perspective

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Cost/Benefit Case for IBM DB for High Performance Analytics

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Fast Analytics on Big Data with H20

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Unified Big Data Processing with Apache Spark. Matei

Advanced Big Data Analytics with R and Hadoop

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

In-Memory Data Management for Enterprise Applications

Oracle Database In-Memory A Practical Solution

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

CitusDB Architecture for Real-Time Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Scaling Your Data to the Cloud

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Best Practices for Hadoop Data Analysis with Tableau

Oracle Database 11g Comparison Chart

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Safe Harbor Statement

IBM Netezza High Capacity Appliance

Where is Hadoop Going Next?

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

How To Manage Big Data In A Microsoft Cloud (Hadoop)

Big Data Course Highlights

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HiBench Introduction. Carson Wang Software & Services Group

2009 Oracle Corporation 1

Oracle Database 12c Plug In. Switch On. Get SMART.

Introduction to Big Data Training

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Exadata Database Machine

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Under The Hood. Tirthankar Lahiri Vice President Data Technologies and TimesTen October 28, #DBIM12c

In-memory databases and innovations in Business Intelligence

Oracle Database 12c for Data Warehousing and Big Data ORACLE WHITE PAPER SEPTEMBER 2014

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

SAP HANA In-Memory Database Sizing Guideline

Introducing Oracle Exalytics In-Memory Machine

Scaling Out With Apache Spark. DTL Meeting Slides based on

Main Memory Data Warehouses

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

BIG DATA What it is and how to use?

Oracle Database In- Memory Op4on in Ac4on

SQL Server 2012 Performance White Paper

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Using distributed technologies to analyze Big Data

Parallel Data Warehouse

Fact Sheet In-Memory Analysis

Transcription:

Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software Developer IBM BigSQL

Agenda Enterprise Warehouses and Open Source Analytics IBM BigSQL The impact of modern Warehouse technology Accelerating Apache Spark through SQL push-down Lessons Learned 2

Enterprise Warehouses and Open Source Analytics Warehouses provide critical features like security, performance, backup/recovery etc. that make it the technology of choice for most enterprise uses. However, modern workloads like advanced analytics or machine learning require a more flexible and agile environment, like Apache Hadoop or Apache Spark. Combining both allows to get the best of both worlds the enterprise-strength and performance of modern Warehouses and the flexibility and power of open source analytics. 3

BigSQL Most existing applications in the enterprise use SQL SQL bridges the chasm between existing apps and Big SQL access to all data stored in Hadoop Via JDBC/ODBC Hadoop Application SQL Language JDBC / ODBC Driver JDBC / ODBC Server Big SQL Engine Using rich standard SQL Intelligently leverge database parallelism and query optimization Sources HiveTables HBase tables CSV Files 4

BigSQL Architecture Big SQL Master Node Big SQL Scheduler base Service Hive Server UDF DDL Hive Metastore Management Node Management Node Big SQL Worker Node Native I/O Java I/O UDF Node MRTask Tracker Big SQL Worker Node Native I/O Java I/O UDF Node MRTask Tracker Big SQL Worker Node Native I/O Java I/O UDF Node MRTask Tracker Temp Compute Node Other Service Temp Compute Node Other Service Temp Compute Node Other Service * = Fenced mode process 5

BigSQL Advantages Extends the realm of database / data warehousing techology to the Hadoop world Broader functionality and better performance compared to native Hadoop SQL implementations Existing skills (like SQL) can be reused Existing applications (like for reporting) can be reused as well Allows the seamless exchange of data between Hadoop based applications and Warehouse applications through. This can be used to easily share data between Apache Spark and the BigSQL Warehouse. 6

The Impact of Modern Warehouse Technology Enterprise Warehouses technology evolved significantly in the recent years. Some of the technologies involved include: Columnar storage Improved Compression Query execution on compressed data In-Memory storage SIMD optimization The IBM DB2 BLU Acceleration engine deployed in DB2 10 LUW and IBM dashdb is an example of such technology. 7

BLU Acceleration used in IBM DB2 and IBM dashdb Dynamic In-Memory In-memory columnar processing with dynamic movement of data from storage Actionable Compression Patented compression technique that preserves order so data can be used without decompressing C1 C2 C3 C4 C5 C6 C7 C8 Encoded Parallel Vector Processing Multi-core and SIMD parallelism (Single Instruction Multiple ) Skipping Skips unnecessary processing of irrelevant data Instructions 8 Results

BLU Columnar Storage Minimal I/O Only perform I/O on the columns and values that match query As queries progresses through a pipeline the working set of pages is reduced Work performed directly on columns Predicates, joins, scans, etc. all work on individual columns Rows are not materialized until absolutely necessary to build result set Improved memory density Columnar data kept compressed in memory Extreme compression Packing more data values into very small amount of memory or disk Cache efficiency packed into cache friendly structures 9

BLU uses Multiple Compression Techniques Approximate Huffman-Encoding ( frequency-based compression ), prefix compression, and offset compression Frequency-based compression: Most common values use fewest bits 0 = California 1 = NewYork 000 = Arizona 001 = Colorado 010 = Kentucky 011 = Illinois 111 = Washington 000000 = Alaska 000001 = Rhode Island 2 High Frequency States (1 bit covers 2 entries) 8 Medium Frequency States (3 bits cover 8 entries) 40 Low Frequency States (6 bits cover 64 entries) Exploiting skew in data distribution improves compression ratio Very effective since all values in a column have the same data type Maps entire values to dictionary codes 10

Remains Compressed During Evaluation Encoded values do not need to be decompressed during evaluation Predicates (=, <, >, >=, <=, Between, etc), joins, aggregations and more work directly on encoded values Johnson Encode LAST_NAME Brown Johnson Johnson Johnson Johnson Brown Johnson Gilligan Wong Johnson Encoding SELECT COUNT(*) FROM T1 WHERE LAST_NAME = Johnson Count = 123456 11

BLU skipping Automatic detection of large sections of data that do not qualify for a query and can be ignored Order of magnitude savings in all of I/O, RAM, and CPU No DBA action to define or use truly invisible Persistent storage of min and max values for sections of data values 12

BLU Synopsis Table Meta-data that describes which ranges of values exist in which parts of the user table SYN130330165216275152_SALES_COL TSNMIN TSNMAX S_DATEMIN S_DATEMAX... 0 1023 2005-03-01 2006-10-17... 1024 2047 2006-08-25 2007-09-15...... 0 1023 1024 User table: SALES_COL S_DATE QTY... 2005-03-01 176... 2005-03-02 85... 2005-03-02 267 2005-03-04 231......... TSN = Tuple Sequence Number Enables DB2 to skip portions of a table when scanning data to answer a query Benefits from data clustering, loading presorted data 2047 13

How DB2 with BLU Acceleration Helps ~Sub second 10TB query The system 32 cores, 10TB table with 100 columns, 10 years of data The query: SELECT COUNT(*) from MYTABLE where YEAR = 2010 The result: sub second 10TB query! Each CPU core examines the equivalent of just 8MB of data 10TB data 1TB after storage savings 10GB column access 1GB after data skipping 32MB linear scan on each core Scans as fast as 8MB encoded and SIMD DB2 WITH BLU ACCELERATION Subsecond 10TB query 14

How to easily Benefit from modern Warehouse Technology from Apache Spark? Use the Warehouse as pure data source and pull all or selected data into Spark RDDs Spark benefits from fast data access, but none of the DB indexing structures is used fully and data is replicated in Spark requiring additional memory. Push down processing from Spark into the underlying Warehouse Full exploitation of the features of the underlying data base engine. 15

SQL Pushdown of (Complex) Operations from Spark Assumption: Source data is already stored in the Warehouse or is loaded into the Warehouse to speed up processing. Instead of creating Spark RDDs of the input data and applying (statistical) processing to these RDDs, we can do some of the necessary processing directly in-database. The Spark Frame already supports some operations that could directly benefit from database engine indexing structures, e.g. group by, selection, projection, join, etc. However, we can even go a step further and speed up more complex statistical operations as, e.g. regression models. 16

SQL Pushdown of (Complex) Operations from Spark to BLU Idea: Push data intensive parts of algorithms to the database engine and do the remaining processing in Spark. Showcase: Linear Regression with few predictors Can be split into two steps: 1. Calculating the covariance matrix 2. Calculating the actual linear model The first step can be pushed to the data warehouse engine by issuing a simple select SELECT SUM(V1*V2), FROM INPUT The second step is not data intensive and easy to compute Performance (based on IBM dashdb enterprise plan) 100.000.000 rows 10 numeric predictors Calculating a linear model in-database takes <3sec which is less than even loading the data into an Spark RDD 17

Towards Warehouse Aware Implementations of Spark Functions Idea: Create implementations of data intensive Spark algorithms that are aware of and can use the underlying Warehouse engine to push-down part of the processing. Some examples of algorithms that could profit heavily from this kind of processing: Naïve Bayes k-means Tree models Calculation of statistics All these could benefit from fast aggregation, selection, etc. offered by Warehouses without replicating the data into Spark or creating additional index structures. 18

Lessons Learned Warehouse technology made some significant advances in the recent years. Combining the enterprise-strength and performance of modern Warehouses with the flexibility and power of Spark is a perfect match. To make full use of the underlying Warehouse, Spark processing should be pushed into the Warehouse engine wherever possible. This also avoids additional memory consumption, as data does not need to be cached in Spark to achieve high performance. Using SQL push-down offers a simple and generic way to achieve this. 19