Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. Oracle Confidential Internal/Restricted/Highly Restricted 3
Agenda 1 2 3 4 The Data Analytics Challenge Why Unified Query Matters SQL on Hadoop and More: Unifying Metadata Query Franchising: Smart Scan for Hadoop Oracle Confidential Internal/Restricted/Highly Restricted 4
Data Analytics Challenge Separate silos of information to analyze 5
Data Analytics Challenge Separate data access interfaces 6
SQL on Hadoop is Obvious Stinger Oracle Confidential Internal/Restricted/Highly Restricted 7
Data Analytics Challenge No comprehensive SQL interface across Oracle, Hadoop and NoSQL 8
Oracle Big Data Management System Rich, comprehensive SQL access to all enterprise data NoSQL 9
What Does Unified Query Mean for You? Before Data Science After PhD Anyone???
What Does Unified Query Mean for You? Before Application Development After
Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential
} next = linenext.getquantity(); if (!q.isempty() && (prev.isempty() (eq(q, prev) && gt(q, next)))) { state = "S"; return state; } if (gt(q, prev) && gt(q, next)) { state = "T"; return state; Pattern } Matching With Oracle SQL Snapshot of Oracle SQL Analytic Functions if (lt(q, prev) && lt(q, next)) { state = "B"; return state; } if (!q.isempty() && (next.isempty() (gt(q, prev) && eq(q, next)))) { state = "E"; return state; Simplified, } sophisticated, standards based syntax if (q.isempty() eq(q, prev)) { state = "F"; return state; } Finding Patterns in Stock Market Data - Double Bottom (W) Ticker 10:00 10:05 10:10 10:15 10:20 10:25 } return state; private boolean eq(string a, String b) { if (a.isempty() b.isempty()) { return false; } return a.equals(b); } private boolean gt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) > Double.parseDouble(b); } private boolean lt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) < Double.parseDouble(b); } public String getstate() { return this.state; } } BagFactory bagfactory = BagFactory.getInstance(); @Override public Tuple exec(tuple input) throws IOException { SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )) long c = 0; String line = ""; String pbkey = ""; V0Line nextline; V0Line thisline; V0Line processline; V0Line evalline = null; V0Line prevline; boolean nomorevalues = false; String matchlist = ""; ArrayList<V0Line> linefifo = new ArrayList<V0Line>(); boolean finished = false; 250+ Lines of Java UDF 12 Lines of SQL 13 DataBag output = bagfactory.newdefaultbag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } Copyright 2014, Oracle and/or its affiliates. All rights reserved. //Object o = input.get(0); if (!(o instanceof DataBag)) { int errcode = 2114; 20x less code
Oracle Big Data SQL A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Oracle Confidential Internal/Restricted/Highly Restricted 14
100% Want to know what this really means.
SQL on Hadoop and More: Unifying Metadata
Why Unify Metadata? CREATE TABLE customers SELECT name FROM customers sales Catalog customers Query CREATE across TABLE sources sales Integrate new metadata SELECT No changes customers.name, for users sales.amount and applications Seamlessly handle schema-on-read Exploit remote data distribution Holistically optimize queries SALES CUSTOMERS
How Data is Stored in Hadoop Example: 1TB File {"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"n","activity":7} {"custid":1083711,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} Block B1 {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:32","recommended":"y","activity":7} {"custid":1010220,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:42","recommended":"y","activity":6} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custid":1253676,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custid":1351777,"movieid":608,"genreid":6,"time":"2012-07-01:00:01:03","recommended":"n","activity":7} Block B2 {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:01:18","recommended":"y","activity":7} {"custid":1067283,"movieid":1124,"genreid":9,"time":"2012-07-01:00:01:26","recommended":"y","activity":7} {"custid":1126174,"movieid":16309,"genreid":9,"time":"2012-07-01:00:01:35","recommended":"n","activity":7} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:01:39","recommended":"y","activity":7}} Block B3 {"custid":1346299,"movieid":424,"genreid":1,"time":"2012-07-01:00:05:02","recommended":"y","activity":4} 1 block = 256 MB Example File = 4096 blocks InputSplits = 4096 Potential scan parallelism Oracle Confidential Internal/Restricted/Highly Restricted 18
How MapReduce and Hive Read Data Consumer Scan and row creation needs to be able to work on any data format Create ROWS & COLUMNS SCAN Data Node disk Data definitions and column deserializations are needed to provide a table RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes 19
SQL-on-Hadoop Engines Share Metadata, not MapReduce Hive Metastore Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Table Definitions: movieapp_log_json Tweets avro_log Metastore maps DDL to Java access classes Oracle Confidential Internal/Restricted/Highly Restricted 20
Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (inherit metadata) ORACLE_HDFS (specify metadata) Access parameters for Big Data Hadoop cluster Remote Hive database/table DBMS_HADOOP Package for automatic import 21
Enhance Oracle External Tables CREATE TABLE ORDER ( cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) PARALLEL 20 REJECT LIMIT UNLIMITED; Transparent schema-for-read Use fast C-based readers when possible Use native Hadoop classes otherwise Engineered to understand parallelism Map external units of parallelism to Oracle Architected for extensibility StorageHandler capability enables future support for other data sources Examples: MongoDB, HBase, Oracle NoSQL DB 22
Query Franchising: Smart Scan for Hadoop
What Can Big Data Learn from Exadata? Intelligent Storage Maximizes Performance SELECT name, SUM(purchase) FROM customers GROUP BY name; 1 Oracle SQL query issued Plan constructed Query executed Oracle Exadata Storage Server 2 Smart Scan Works on Storage Filter out unneeded rows Project only queried columns Score data models Bloom filters to speed up joins Oracle Exadata Storage Server CUSTOMERS
Query Franchising dispatch of query processing to self-similar compute agents on disparate systems without loss of operational fidelity
Big Data SQL Server: A New Hadoop Processing Engine MapReduce and Hive Processing Layer Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) Oracle Confidential Internal/Restricted/Highly Restricted 26
Smart Scan for Hadoop: Optimizing Performance Big Data SQL Server Smart Scan External Table Services Data Node Oracle on top Apply filter predicates Project columns Parse semi-structured data Hadoop on the bottom Work close to the data Schema-on-read with Hadoop classes Transformation into Oracle data stream Disk Oracle Confidential Internal/Restricted/Highly Restricted 27
Big Data SQL Query Execution How do we query Hadoop? HDFS NameNode 1 Hive Metastore HDFS Data Node BDS Server B B B HDFS Data Node BDS Server 2 3 1 2 3 Query compilation determines: Data locations Data structure Parallelism Fast reads using Big Data SQL Server Schema-for-read using Hadoop classes Smart Scan selects only relevant data Process filtered result Move relevant data to database Join with database tables Apply database security policies
Mapping Hadoop to Oracle Parallel Query and Hadoop HDFS NameNode Hive Metastore B B B 1 InputSplits 2 PX 1 2 3 Determine Hadoop Parallelism Determine schema-for-read Determine InputSplits Arrange splits for best performance Map to Oracle Parallelism Map splits to granules Assign granules to PX Servers PX Servers Route Work Offload work to Big Data SQL Servers Aggregate Join Apply PL/SQL
1011001 0 1011001 0 10110010 Big Data SQL Server Dataflow Big Data SQL Server Smart Scan External Table Services 3 2 1 2 Read data from HDFS Data Node Direct-path reads C-based readers when possible Use native Hadoop classes otherwise Translate bytes to Oracle SerDe RecordReader Data Node Disks 1 3 Apply Smart Scan to Oracle bytes Apply filters Project Columns Parse JSON/XML Score models
But How Does Security Work? DBMS_REDACT.ADD_POLICY( object_schema => 'MCLICK', SELECT object_name * FROM => 'TWEET_V', my_bigdata_table WHERE column_name SALES_REP_ID => 'USERNAME', = SYS_CONTEXT('USERENV','SESSION_USER'); policy_name => 'tweet_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25', expression => '1=1' ); B B B *** Filter on SESSION_USER 1 2 3 Database security for query access Virtual Private Databases Redaction Audit Vault and Database Firewall Hadoop security for Hadoop jobs Kerberos Authentication Apache Sentry (RBAC) Audit Vault System-specific encryption Database tablespace encryption BDA On-disk Encryption
SQL, Everywhere Futures
More Lessons from Exadata Move less data Go faster 1 Storage Indexes Skip reads on irrelevant data Big Hadoop Blocks ~ Big Speed Up 2 Caching Cache frequently accessed columns HDFS Caching
Oracle Big Data Management System Rich, comprehensive SQL access to all enterprise data NoSQL 34
Oracle Big Data Management System Unite Information Lifecycles NoSQL Shared REST APIs App-embedded schema NoSQL Shared schema Oracle Automatic ILM Roll off cold partitions to Hadoop Promote hot data to Oracle 35
Oracle Big Data Management System Unify All Query NoSQL 36
Big Data SQL SELECT w.sess_id,w.cust_id,c.name FROM web_logs w, customers c WHERE w.source_country = Brazil AND c.customer_id = w.cust_id WEB_LOGS Big Data SQL B B B Hadoop Cluster Relevant SQL runs on BDA nodes 10 s of Gigabytes of Data Only columns and rows needed to answer query are returned Oracle Database CUSTOMERS 39