Oracle Big Data SQL Architectural Deep Dive

Similar documents

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Big Data SQL and Query Franchising

Oracle Big Data SQL Konference Data a znalosti 2015

Seamless Access from Oracle Database to Your Big Data

SQL - the best analysis language for Big Data!

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

Oracle Big Data SQL Technical Update

OLSUG Workshop Oracle Data Mining

Oracle Database 12c Plug In. Switch On. Get SMART.

How To Manage Big Data In A Microsoft Cloud (Hadoop)

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Safe Harbor Statement

Oracle Data Mining In-Database Data Mining Made Easy!

Constructing a Data Lake: Hadoop and Oracle Database United!

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Cloudera Certified Developer for Apache Hadoop

Integrating Apache Spark with an Enterprise Data Warehouse

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Domain Profiling and Data Masking for Hadoop

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Oracle's In-Database Statistical Functions

Complete Java Classes Hadoop Syllabus Contact No:

Unified Query for Big Data Management Systems

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Apache Sentry. Prasad Mujumdar

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Big Data Analytics with Oracle Advanced Analytics In-Database Option

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Hadoop Eco System Shanghai Data Science Meetup

Architecting for the Internet of Things & Big Data

HADOOP. Revised 10/19/2015

Oracle Big Data Fundamentals Ed 1 NEW

Real World Hadoop Use Cases

A very short Intro to Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

I/O Considerations in Big Data Analytics

Predictive Analytics for Better Business Intelligence

Where is Hadoop Going Next?

DBMS / Business Intelligence, SQL Server

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Oracle Big Data Strategy Simplified Infrastrcuture

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Course Highlights

How To Scale Out Of A Nosql Database

Hadoop & Spark Using Amazon EMR

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Workshop on Hadoop with Big Data

Big Data Technologies Compared June 2014

Internals of Hadoop Application Framework and Distributed File System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Certified Big Data and Apache Hadoop Developer VS-1221

Using distributed technologies to analyze Big Data

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Big Data Introduction

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Using RDBMS, NoSQL or Hadoop?

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

The Future of Data Management

Cloudera Backup and Disaster Recovery

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Hadoop Ecosystem B Y R A H I M A.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Xiaoming Gao Hui Li Thilina Gunarathne

AtScale Intelligence Platform

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Job Oriented Training Agenda

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

How To Create A Data Visualization With Apache Spark And Zeppelin

How To Use Cloudera Manager Backup And Disaster Recovery (Brd) On A Microsoft Hadoop (Clouderma) On An Ubuntu Or 5.3.5

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Sentimental Analysis using Hadoop Phase 2: Week 2

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop and Map-Reduce. Swati Gore

Oracle Database 11g Comparison Chart

Big Data and Scripting map/reduce in Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

COURSE CONTENT Big Data and Hadoop Training

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. Oracle Confidential Internal/Restricted/Highly Restricted 3

Agenda 1 2 3 4 The Data Analytics Challenge Why Unified Query Matters SQL on Hadoop and More: Unifying Metadata Query Franchising: Smart Scan for Hadoop Oracle Confidential Internal/Restricted/Highly Restricted 4

Data Analytics Challenge Separate silos of information to analyze 5

Data Analytics Challenge Separate data access interfaces 6

SQL on Hadoop is Obvious Stinger Oracle Confidential Internal/Restricted/Highly Restricted 7

Data Analytics Challenge No comprehensive SQL interface across Oracle, Hadoop and NoSQL 8

Oracle Big Data Management System Rich, comprehensive SQL access to all enterprise data NoSQL 9

What Does Unified Query Mean for You? Before Data Science After PhD Anyone???

What Does Unified Query Mean for You? Before Application Development After

Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

} next = linenext.getquantity(); if (!q.isempty() && (prev.isempty() (eq(q, prev) && gt(q, next)))) { state = "S"; return state; } if (gt(q, prev) && gt(q, next)) { state = "T"; return state; Pattern } Matching With Oracle SQL Snapshot of Oracle SQL Analytic Functions if (lt(q, prev) && lt(q, next)) { state = "B"; return state; } if (!q.isempty() && (next.isempty() (gt(q, prev) && eq(q, next)))) { state = "E"; return state; Simplified, } sophisticated, standards based syntax if (q.isempty() eq(q, prev)) { state = "F"; return state; } Finding Patterns in Stock Market Data - Double Bottom (W) Ticker 10:00 10:05 10:10 10:15 10:20 10:25 } return state; private boolean eq(string a, String b) { if (a.isempty() b.isempty()) { return false; } return a.equals(b); } private boolean gt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) > Double.parseDouble(b); } private boolean lt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) < Double.parseDouble(b); } public String getstate() { return this.state; } } BagFactory bagfactory = BagFactory.getInstance(); @Override public Tuple exec(tuple input) throws IOException { SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )) long c = 0; String line = ""; String pbkey = ""; V0Line nextline; V0Line thisline; V0Line processline; V0Line evalline = null; V0Line prevline; boolean nomorevalues = false; String matchlist = ""; ArrayList<V0Line> linefifo = new ArrayList<V0Line>(); boolean finished = false; 250+ Lines of Java UDF 12 Lines of SQL 13 DataBag output = bagfactory.newdefaultbag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } Copyright 2014, Oracle and/or its affiliates. All rights reserved. //Object o = input.get(0); if (!(o instanceof DataBag)) { int errcode = 2114; 20x less code

Oracle Big Data SQL A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Oracle Confidential Internal/Restricted/Highly Restricted 14

100% Want to know what this really means.

SQL on Hadoop and More: Unifying Metadata

Why Unify Metadata? CREATE TABLE customers SELECT name FROM customers sales Catalog customers Query CREATE across TABLE sources sales Integrate new metadata SELECT No changes customers.name, for users sales.amount and applications Seamlessly handle schema-on-read Exploit remote data distribution Holistically optimize queries SALES CUSTOMERS

How Data is Stored in Hadoop Example: 1TB File {"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"n","activity":7} {"custid":1083711,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} Block B1 {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:32","recommended":"y","activity":7} {"custid":1010220,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:42","recommended":"y","activity":6} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custid":1253676,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custid":1351777,"movieid":608,"genreid":6,"time":"2012-07-01:00:01:03","recommended":"n","activity":7} Block B2 {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:01:18","recommended":"y","activity":7} {"custid":1067283,"movieid":1124,"genreid":9,"time":"2012-07-01:00:01:26","recommended":"y","activity":7} {"custid":1126174,"movieid":16309,"genreid":9,"time":"2012-07-01:00:01:35","recommended":"n","activity":7} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:01:39","recommended":"y","activity":7}} Block B3 {"custid":1346299,"movieid":424,"genreid":1,"time":"2012-07-01:00:05:02","recommended":"y","activity":4} 1 block = 256 MB Example File = 4096 blocks InputSplits = 4096 Potential scan parallelism Oracle Confidential Internal/Restricted/Highly Restricted 18

How MapReduce and Hive Read Data Consumer Scan and row creation needs to be able to work on any data format Create ROWS & COLUMNS SCAN Data Node disk Data definitions and column deserializations are needed to provide a table RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes 19

SQL-on-Hadoop Engines Share Metadata, not MapReduce Hive Metastore Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Table Definitions: movieapp_log_json Tweets avro_log Metastore maps DDL to Java access classes Oracle Confidential Internal/Restricted/Highly Restricted 20

Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (inherit metadata) ORACLE_HDFS (specify metadata) Access parameters for Big Data Hadoop cluster Remote Hive database/table DBMS_HADOOP Package for automatic import 21

Enhance Oracle External Tables CREATE TABLE ORDER ( cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) PARALLEL 20 REJECT LIMIT UNLIMITED; Transparent schema-for-read Use fast C-based readers when possible Use native Hadoop classes otherwise Engineered to understand parallelism Map external units of parallelism to Oracle Architected for extensibility StorageHandler capability enables future support for other data sources Examples: MongoDB, HBase, Oracle NoSQL DB 22

Query Franchising: Smart Scan for Hadoop

What Can Big Data Learn from Exadata? Intelligent Storage Maximizes Performance SELECT name, SUM(purchase) FROM customers GROUP BY name; 1 Oracle SQL query issued Plan constructed Query executed Oracle Exadata Storage Server 2 Smart Scan Works on Storage Filter out unneeded rows Project only queried columns Score data models Bloom filters to speed up joins Oracle Exadata Storage Server CUSTOMERS

Query Franchising dispatch of query processing to self-similar compute agents on disparate systems without loss of operational fidelity

Big Data SQL Server: A New Hadoop Processing Engine MapReduce and Hive Processing Layer Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) Oracle Confidential Internal/Restricted/Highly Restricted 26

Smart Scan for Hadoop: Optimizing Performance Big Data SQL Server Smart Scan External Table Services Data Node Oracle on top Apply filter predicates Project columns Parse semi-structured data Hadoop on the bottom Work close to the data Schema-on-read with Hadoop classes Transformation into Oracle data stream Disk Oracle Confidential Internal/Restricted/Highly Restricted 27

Big Data SQL Query Execution How do we query Hadoop? HDFS NameNode 1 Hive Metastore HDFS Data Node BDS Server B B B HDFS Data Node BDS Server 2 3 1 2 3 Query compilation determines: Data locations Data structure Parallelism Fast reads using Big Data SQL Server Schema-for-read using Hadoop classes Smart Scan selects only relevant data Process filtered result Move relevant data to database Join with database tables Apply database security policies

Mapping Hadoop to Oracle Parallel Query and Hadoop HDFS NameNode Hive Metastore B B B 1 InputSplits 2 PX 1 2 3 Determine Hadoop Parallelism Determine schema-for-read Determine InputSplits Arrange splits for best performance Map to Oracle Parallelism Map splits to granules Assign granules to PX Servers PX Servers Route Work Offload work to Big Data SQL Servers Aggregate Join Apply PL/SQL

1011001 0 1011001 0 10110010 Big Data SQL Server Dataflow Big Data SQL Server Smart Scan External Table Services 3 2 1 2 Read data from HDFS Data Node Direct-path reads C-based readers when possible Use native Hadoop classes otherwise Translate bytes to Oracle SerDe RecordReader Data Node Disks 1 3 Apply Smart Scan to Oracle bytes Apply filters Project Columns Parse JSON/XML Score models

But How Does Security Work? DBMS_REDACT.ADD_POLICY( object_schema => 'MCLICK', SELECT object_name * FROM => 'TWEET_V', my_bigdata_table WHERE column_name SALES_REP_ID => 'USERNAME', = SYS_CONTEXT('USERENV','SESSION_USER'); policy_name => 'tweet_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25', expression => '1=1' ); B B B *** Filter on SESSION_USER 1 2 3 Database security for query access Virtual Private Databases Redaction Audit Vault and Database Firewall Hadoop security for Hadoop jobs Kerberos Authentication Apache Sentry (RBAC) Audit Vault System-specific encryption Database tablespace encryption BDA On-disk Encryption

SQL, Everywhere Futures

More Lessons from Exadata Move less data Go faster 1 Storage Indexes Skip reads on irrelevant data Big Hadoop Blocks ~ Big Speed Up 2 Caching Cache frequently accessed columns HDFS Caching

Oracle Big Data Management System Rich, comprehensive SQL access to all enterprise data NoSQL 34

Oracle Big Data Management System Unite Information Lifecycles NoSQL Shared REST APIs App-embedded schema NoSQL Shared schema Oracle Automatic ILM Roll off cold partitions to Hadoop Promote hot data to Oracle 35

Oracle Big Data Management System Unify All Query NoSQL 36

Big Data SQL SELECT w.sess_id,w.cust_id,c.name FROM web_logs w, customers c WHERE w.source_country = Brazil AND c.customer_id = w.cust_id WEB_LOGS Big Data SQL B B B Hadoop Cluster Relevant SQL runs on BDA nodes 10 s of Gigabytes of Data Only columns and rows needed to answer query are returned Oracle Database CUSTOMERS 39