Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Similar documents

Big Data SQL and Query Franchising

Oracle Big Data SQL Architectural Deep Dive

Seamless Access from Oracle Database to Your Big Data

Oracle Big Data SQL Konference Data a znalosti 2015

Oracle Big Data SQL Technical Update

How To Manage Big Data In A Microsoft Cloud (Hadoop)

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Safe Harbor Statement

Oracle Big Data Fundamentals Ed 1 NEW

Apache Sentry. Prasad Mujumdar

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Domain Profiling and Data Masking for Hadoop

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Query for Big Data Management Systems

Oracle Database 12c Plug In. Switch On. Get SMART.

Integrating Apache Spark with an Enterprise Data Warehouse

Complete Java Classes Hadoop Syllabus Contact No:

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Architecting for the Internet of Things & Big Data

Using RDBMS, NoSQL or Hadoop?

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

COURSE CONTENT Big Data and Hadoop Training

Cloudera Certified Developer for Apache Hadoop

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Ecosystem B Y R A H I M A.

A very short Intro to Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Hadoop Eco System Shanghai Data Science Meetup

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Certified Big Data and Apache Hadoop Developer VS-1221

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Oracle Big Data Strategy Simplified Infrastrcuture

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Cloudera Backup and Disaster Recovery

Apache HBase. Crazy dances on the elephant back

Big Data Course Highlights

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

The Future of Data Management

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Introduction

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop implementation of MapReduce computational model. Ján Vaňo

I/O Considerations in Big Data Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Job Oriented Training Agenda

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop & Spark Using Amazon EMR

Information Builders Mission & Value Proposition

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

HADOOP MOCK TEST HADOOP MOCK TEST I

Where is Hadoop Going Next?

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Internals of Hadoop Application Framework and Distributed File System

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Putting Apache Kafka to Use!

How To Scale Out Of A Nosql Database

Hadoop. Sunday, November 25, 12

How to Choose Between Hadoop, NoSQL and RDBMS

Oracle Database In- Memory & Rest of the Database Performance Features

Cloudera Backup and Disaster Recovery

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Hadoop Big Data for Processing Data and Performing Workload

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Oracle Big Data Essentials

Ganzheitliches Datenmanagement

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Hadoop: Embracing future hardware

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Workshop on Hadoop with Big Data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Peers Techno log ies Pv t. L td. HADOOP

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Transcription:

Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle Copyright 2014, Oracle and/or its aﬃliates. All rights reserved.

Safe Harbor Statement The following is intended to outline our general product direcmon. It is intended for informamon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funcmonality, and should not be relied upon in making purchasing decisions. The development, release, and Mming of any features or funcmonality described for Oracle s products remains at the sole discremon of Oracle. Oracle ConfidenMal Internal/Restricted/Highly Restricted 3

Agenda 1 2 3 4 The Data AnalyMcs Challenge Why Unified Query MaWers SQL on Hadoop and More: Unifying Metadata Query Franchising: Smart Scan for Hadoop Oracle ConfidenMal Internal/Restricted/Highly Restricted 4

Data AnalyMcs Challenge (2012) Separate data access interfaces 5

Tables on NoSQL SQL on Hadoop 6

Data AnalyMcs Challenge: 2013 No comprehensive SQL interface across Oracle, Hadoop and NoSQL 7

Oracle Big Data Management System Unified SQL access to all enterprise data NoSQL 8

How to Get the Most From Polyglot Persistence Rapid Development Simple SemanDcs AcDve Archive Data ExploraDon Scale- out AnalyDcs Business- cridcal processes SensiDve Data ExisDng ApplicaDons

SQL on Hadoop and More: Unifying Metadata

Why Unify Metadata? CREATE TABLE customers! SELECT name FROM customers! sales Catalog customers Query CREATE across TABLE sources sales! à Integrate new metadata SELECT No changes customers.name, for users and sales.amount! applicamons Seamlessly handle schema- on- read Exploit remote data distribumon HolisMcally opmmize queries SALES CUSTOMERS

How Data is Stored in HDFS Example: 1TB File {"custid":1185972,"movieid":null,"genreid":null,"mme":"2012-07- 01:00:00:07","recommended":null,"acMvity":8} {"custid":1354924,"movieid":1948,"genreid":9,"mme":"2012-07- 01:00:00:22","recommended":"N","acMvity":7} {"custid":1083711,"movieid":null,"genreid":null,"mme":"2012-07- 01:00:00:26","recommended":null,"acMvity":9} Block B1 {"custid":1234182,"movieid":11547,"genreid":44,"mme":"2012-07- 01:00:00:32","recommended":"Y","acMvity":7} {"custid":1010220,"movieid":11547,"genreid":44,"mme":"2012-07- 01:00:00:42","recommended":"Y","acMvity":6} {"custid":1143971,"movieid":null,"genreid":null,"mme":"2012-07- 01:00:00:43","recommended":null,"acMvity":8} {"custid":1253676,"movieid":null,"genreid":null,"mme":"2012-07- 01:00:00:50","recommended":null,"acMvity":9} {"custid":1351777,"movieid":608,"genreid":6,"mme":"2012-07- 01:00:01:03","recommended":"N","acMvity":7} Block B2 {"custid":1143971,"movieid":null,"genreid":null,"mme":"2012-07- 01:00:01:07","recommended":null,"acMvity":9} {"custid":1363545,"movieid":27205,"genreid":9,"mme":"2012-07- 01:00:01:18","recommended":"Y","acMvity":7} {"custid":1067283,"movieid":1124,"genreid":9,"mme":"2012-07- 01:00:01:26","recommended":"Y","acMvity":7} {"custid":1126174,"movieid":16309,"genreid":9,"mme":"2012-07- 01:00:01:35","recommended":"N","acMvity":7} {"custid":1234182,"movieid":11547,"genreid":44,"mme":"2012-07- 01:00:01:39","recommended":"Y","acMvity":7}} Block B3 {"custid":1346299,"movieid":424,"genreid":1,"mme":"2012-07- 01:00:05:02","recommended":"Y","acMvity":4} 1 block = 256 MB Example File = 4096 blocks InputSplits = 4096 PotenMal scan parallelism Oracle ConfidenMal Internal/Restricted/Highly Restricted 12

How MapReduce and Hive Read Data Consumer Scan and row creamon needs to be able to work on any data format Create ROWS & COLUMNS SCAN Data Node disk Data definimons and column deserializamons are needed to provide a table RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes 13

SQL- on- Hadoop Engines Share Metadata, not MapReduce Hive Metastore Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Table DefiniMons: movieapp_log_json Tweets avro_log Metastore maps DDL to Java access classes Oracle ConfidenMal Internal/Restricted/Highly Restricted 14

Extend Oracle External Tables CREATE TABLE movielog (! click VARCHAR2(4000))! ORGANIZATION EXTERNAL (! TYPE ORACLE_HIVE! DEFAULT DIRECTORY DEFAULT_DIR! ACCESS PARAMETERS! (! com.oracle.bigdata.tablename logs! com.oracle.bigdata.cluster mycluster! ))! REJECT LIMIT UNLIMITED;! New types of external tables ORACLE_HIVE (inherit metadata) ORACLE_HDFS (specify metadata) Access parameters for Big Data Hadoop cluster Remote Hive database/table DBMS_HADOOP Package for automamc import 15

A Smarter External Table Oracle Table You define: Table name Oracle types Any Degree of Parallelism HDFS Data You get: AutomaMc discovery of Hive table metadata AutomaMc translamon from Hadoop types AutomaMc conversion from any InputFormat Fan- out Parallelism across the Hadoop cluster 17

StorageHandlers: Extensibility Beyond HDFS Oracle Big Data SQL StorageHandlers are a metadata bridge. Hive Metastore

Query Franchising: Smart Scan for Hadoop

Language- level FederaMon Fails Been there, done that. Hadoop Part with sites as! (! select s.custid as cust_id, listagg(s.site, ',')! within group (order by s.custid) site_list! from shortcodes s! group by custid! ),! select c.first_name, c.last_name, c.age, c.state_province, s.site_list! from customers c, sites s;!! Database Part

Language- level FederaMon Fails Been there, done that. select s.custid as cust_id,! listagg(s.site, ',')! within group (order by! s.custid) site_list! from shortcodes s! group by custid!! Operators exist in both places? Is their behavior consistent? How do you negomate resources?

Query Franchising dispatch of query processing to self- similar compute agents on disparate systems without loss of opera9onal fidelity

Query Franchising: Uniform Behavior, Disparate LocaMon 2 1 1 Top- level plan created HolisMc plan plan for all work Distribute to franchises based by locamon 2 2 Franchisees carry out local work Franchises secure and umlize resources All franchises speak the internal language 2 3 Global operamons opmmized Adapts to local variamon Nothing lost in translamon

What Can Big Data Learn from Exadata? Intelligent Storage Maximizes Performance SELECT name, SUM(purchase)! FROM customers! GROUP BY name;! 1 Oracle SQL query issued Plan constructed Query executed Oracle Exadata Storage Server 2 Smart Scan Works on Storage Filter out unneeded rows Project only queried columns Score data models Bloom filters to speed up joins Oracle Exadata Storage Server CUSTOMERS

Big Data SQL: A New Hadoop Processing Engine MapReduce and Hive Processing Layer Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) Oracle ConfidenMal Internal/Restricted/Highly Restricted 26

Smart Scan for Hadoop: OpMmizing Performance Big Data SQL Server Smart Scan External Table Services Data Node Oracle on top Apply filter predicates Project columns Parse semi- structured data Hadoop on the boxom Work close to the data Schema- on- read with Hadoop classes TransformaMon into Oracle data stream Disk Oracle ConfidenMal Internal/Restricted/Highly Restricted 27

Mapping Hadoop to Oracle Parallel Query and Hadoop HDFS NameNode Hive Metastore B B B 1 InputSplits 2 PX 1 2 3 Determine Hadoop Parallelism Determine schema- for- read Determine InputSplits Arrange splits for best performance Map to Oracle Parallelism Map splits to granules Assign granules to PX Servers PX Servers Route Work Offload work to Big Data SQL Servers Aggregate Join Apply PL/SQL

Big Data SQL 1.1 Introduces Enhanced Parallelism Big Data SQL 1.0 Hadoop DoP linked to RDBMS DoP Lead to many idle PQ processes Required explicit declaramon Big Data SQL 1.1 Unlink Hadoop and RDBMS DoP AutomaMc max Hadoop parallelism Even on serial tables An average of 40% faster Even at equivalent DoP

Big Data SQL Dataflow Big Data SQL Agent Smart Scan External Table Services SerDe RecordReader Data Node Disks 10110010 10110010 10110010 3 2 1 1 2 3 Read data from HDFS Data Node Direct- path reads C- based readers when possible Use namve Hadoop classes otherwise Translate bytes to Oracle Apply Smart Scan to Oracle bytes Apply filters Project Columns Parse JSON/XML Score models

But How Does Security Work? DBMS_REDACT.ADD_POLICY(! object_schema => 'MCLICK',! SELECT object_name * FROM => my_bigdata_table! 'TWEET_V',! WHERE column_name SALES_REP_ID => 'USERNAME',! = SYS_CONTEXT('USERENV','SESSION_USER'); policy_name => 'tweet_redaction',!! function_type => DBMS_REDACT.PARTIAL,! function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',! expression => '1=1'! );! B B B *** Filter on SESSION_USER 1 2 3 Database security for query access Virtual Private Databases RedacMon Audit Vault and Database Firewall Hadoop security for Hadoop jobs Kerberos AuthenMcaMon Apache Sentry (RBAC) Audit Vault System- specific encrypmon Database tablespace encrypmon BDA On- disk EncrypMon

SQL, Everywhere Futures

More Lessons from Exadata Move less data à Go faster 1 2 Storage Indexes Skip reads on irrelevant data Big Hadoop Blocks ~ Big Speed Up Caching Cache frequently accessed columns HDFS Caching

Try it out: Oracle Big Data Lite hxp://www.oracle.com/technetwork/database/bigdata- appliance/oracle- bigdatalite- 2104726.html

Fan- Out Parallelism Approach #1 Schema- on- Write, Small blocks RDBMS PQ PQ Host 1 Host 2 Host 3

Fan- Out Parallelism Approach #2 Schema- on- Read, Large Blocks