Oracle Big Data SQL Konference Data a znalosti 2015 Jakub ILLNER Information Management Architect XLOB Enterprise Cloud Architects 23 July 2015, version 2
Agenda 1 2 3 4 5 Is SQL Dead? Introducing Oracle Big Data SQL How Oracle Big Data SQL Works Demonstration Questions and Answers 2
Is SQL Dead? With all the Big Data and NoSQL technologies, why to bother with SQL 3
Is SQL Dead? 4
//Part 5 // Keeping context var lastuniqueid = "foobar" var lastrecord: (DataKey, Int) = null var lastlastrecord: (DataKey, Int) = null Peaks and Valleys in Spark vs. SQL var position = 0 it.foreach( r => { position = position + 1 if (!lastuniqueid.equals(r._1.uniqueid)) { Ticker lastrecord = null lastlastrecord = null //Part 6 : Finding those peaks and valleys if (lastrecord!= null && lastlastrecord!= null) { if (lastrecord._2 < r._2 && lastrecord._2 < lastlastrecord._2) { results.+=(new PivotPoint(r._1.uniqueId, position, lastrecord._1.eventtime, lastrecord._2, false)) else if (lastrecord._2 > r._2 && lastrecord._2 > lastlastrecord._2) { results.+=(new PivotPoint(r._1.uniqueId, position, lastrecord._1.eventtime, lastrecord._2, true)) lastuniqueid = r._1.uniqueid lastlastrecord = lastrecord lastrecord = r ) results.iterator ) //Part 7 : pretty everything up pivotpointrdd.map(r => { val pivottype = if (r.ispeak) "peak" else "valley" r.uniqueid + "," + r.position + "," + r.eventtime + "," + r.eventvalue + "," + pivottype ).saveastextfile(outputpath) class DataKey(val uniqueid:string, val eventtime:long) extends Serializable with Comparable[DataKey] { override def compareto(other:datakey): Int = { val compare1 = uniqueid.compareto(other.uniqueid) if (compare1 == 0) { eventtime.compareto(other.eventtime) else { compare1 class PivotPoint(val uniqueid: String, val position:int, val eventtime:long, val eventvalue:int, val ispeak:boolean) extends Serializable { package com.hadooparchitecturebook.spark.peaksandvalleys import org.apache.hadoop.io.{text, LongWritable import org.apache.hadoop.mapred.textinputformat import org.apache.spark.rdd.shuffledrdd import org.apache.spark.{partitioner, SparkContext, SparkConf import scala.collection.mutable /** * Created by ted.malaska on 12/7/14. */ object 5 SparkPeaksAndValleysExecution { def main(args: Array[String]): Unit = { if (args.length == 0) { Finding Peaks and Valleys in Stock Market Data SELECT PRIMARY KEY, POSITION, EVENT_VALUE, CASE WHEN LEAD_EVENT_VALUE is null or LAG_EVENT_VALUE is null then 'EDGE' WHEN EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE then 'VALLEY' WHEN EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE then 'PEAK' ELSE 'SLOPE' AND AS POINT_TYPE FROM ( SELECT PRIMARY_KEY, POSITION, EVENT_VALUE, LEAD(EVENT_VALUE,1,null) OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE, LAG(EVENT_VALUE,1,null) OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE FROM PEAK_AND_VALLEY_TABLE ) 132 Lines of Scala/Spark 17 Lines of SQL Copyright 2014, Oracle and/or its affiliates. All rights reserved. 7x less code 10:00 10:05 10:10 10:15 10:20 10:25 Example taken from Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman & Gwen Shapira (O Reilly, July 2015) 5
Power of SQL Declarative Abstracted (from storage) Concise Powerful Simple to learn Rich analytical functions Fast Secure Standardized Widely used & known 6
SQL on Hadoop Hive First SQL engine on Hadoop Uses MapReduce for execution Contains metastore (HCatalog) New Hive-on-Spark project Impala Developed by Cloudera Fast, in-memory execution Introduced Parquet format Compatible with Hive SparkSQL Spark module for accessing structured data Fast, in-memory execution Compatible with Hive Presto Developed by Facebook Fast, low latency execution Compatible with Hive Connectivity to other sources 7
What if you need to query Hadoop & RDBMS data? Pure Hadoop-on-SQL engines can access data in Hadoop only (HDFS, Hive, Parquet, ORC, HBase etc.) Performance of BI tools like Cognos, Oracle BIEE, SAS, Tableau for large and complex federated queries is limited Possible solution is to use SQL interface & Hadoop integration available with the DBMS platform of choice 8
Which SQL-on-Hadoop approach will you use most? http://blogs.gartner.com/merv-adrian/2015/02/22/which-sql-on-hadoop-poll-still-says-whatever-but-dbms-providers-gain/ From 9% in 2014 to 19% in 2015 9
Introducing Oracle Big Data SQL One SQL to query ALL the data 10
What if you could Make all data easily available to all your Oracle Database applications While supporting the full breadth of Oracle SQL query language With all the security of Oracle Database 12c Without moving data between your Hadoop cluster and the RDBMS And deliver fast query performance While leveraging your existing skills And still utilize the latest Hadoop innovations 11
One SQL to query ALL the data CQL N1QL UnQL SQL HiveQL NoSQL 12
Rich Analytical Functions with Oracle SQL Ranking functions Rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential 13
next = linenext.getquantity(); if (!q.isempty() && (prev.isempty() (eq(q, prev) && gt(q, next)))) { state = "S"; return state; Pattern Matching With Oracle SQL if (gt(q, prev) && gt(q, next)) { state = "T"; return state; Ticker if (lt(q, prev) && lt(q, next)) { state = "B"; return state; if (!q.isempty() && (next.isempty() (gt(q, prev) && eq(q, next)))) { state = "E"; return state; if (q.isempty() eq(q, prev)) { state = "F"; return state; Finding Patterns in Stock Market Data - Double Bottom (W) 10:00 10:05 10:10 10:15 10:20 10:25 return state; private boolean eq(string a, String b) { if (a.isempty() b.isempty()) { return false; return a.equals(b); private boolean gt(string a, String b) { if (a.isempty() b.isempty()) { return false; return Double.parseDouble(a) > Double.parseDouble(b); private boolean lt(string a, String b) { if (a.isempty() b.isempty()) { return false; return Double.parseDouble(a) < Double.parseDouble(b); public String getstate() { return this.state; BagFactory bagfactory = BagFactory.getInstance(); @Override public Tuple exec(tuple input) throws IOException { SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )) 14 long c = 0; String line = ""; String pbkey = ""; V0Line nextline; V0Line thisline; V0Line processline; V0Line evalline = null; V0Line prevline; boolean nomorevalues = false; String matchlist = ""; ArrayList<V0Line> linefifo = new ArrayList<V0Line>(); boolean finished = false; 250+ Lines of Java UDF 12 Lines of Oracle SQL DataBag output = bagfactory.newdefaultbag(); if (input == null) { return null; if (input.size() == 0) { return null; Object o = input.get(0); if (o == null) { return null; Copyright 2014, Oracle and/or its affiliates. All rights reserved. //Object o = input.get(0); if (!(o instanceof DataBag)) { int errcode = 2114; 20x less code 14
Security Virtual Private Database with Oracle SQL SELECT * FROM my_bigdata_table WHERE SALES_REP_ID = SYS_CONTEXT('USERENV','SESSION_USER'); B B B Big Data SQL on Hadoop Cluster Filter on SESSION_USER Oracle Database 12c Oracle Virtual Private Database (VPD) enables you to create security policies to control database access at the row and column level. Oracle VPD adds a dynamic WHERE clause to a SQL statement that is issued against the table, view, or synonym to which an Oracle Virtual Private Database security policy was applied. Because you attach security policies directly to the database objects (tables, views), and the policies are automatically applied whenever a user accesses data, there is no way to bypass security. With Big Data SQL the Oracle Virtual Private Database is available for Hadoop data 15
Security Data Redaction with Oracle SQL DBMS_REDACT.ADD_POLICY( object_schema => 'MCLICK', object_name => 'TWEET_V', column_name => 'USERNAME', policy_name => 'tweet_redaction', function_type => DBMS_REDACT.PARTIAL, function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25', expression => '1=1' ); B B B *** Oracle Data Redaction enables you to create security policies to control what data is visible for sensitive columns with personal or security information. Oracle Data Redaction dynamically applies redaction function to columns. The function transforms, obfuscates or hides the sensitive information for unauthorized users. Since the policy is applied automatically by Oracle Database there is no way to bypass security and get the un-redacted data. With Big Data SQL the Oracle Virtual Private Database is available for Hadoop data Big Data SQL on Hadoop Cluster Oracle Database 12c 16
How Oracle Big Data SQL Works Marriage of Hadoop and Oracle Database Query Processing 17
Hadoop data accessible through Oracle External Tables CREATE TABLE movielog (click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY Dir1 ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster) ) REJECT LIMIT UNLIMITED New set of properties ORACLE_HIVE and ORACLE_HDFS access drivers Identify a Hadoop cluster, data source, column mapping, error handling, overflow handling, logging New table metadata passed from Oracle DDL to Hadoop readers at query execution Architected for extensibility StorageHandler capability enables future support for many other data sources Examples: MongoDB, HBase, Oracle NoSQL DB 18
How Oracle executes a query with Big Data SQL HDFS NameNode 1 Hive Metastore HDFS Data Node BDSQL B B B HDFS Data Node BDSQL 2 3 1 2 3 Query compilation determines: Data locations Data structure Parallelism Fast reads using Big Data SQL Server Schema-for-read using Hadoop classes Smart Scan selects only relevant data Process filtered result Move relevant data to database Join with database tables Apply database security policies Big Data SQL on Hadoop Cluster Oracle Database 12c 19
Big Data SQL: A New Hadoop Processing Engine Processing Layer MapReduce and Hive Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, HBase) 20
Big Data SQL Uses Hive Metastore, not MapReduce Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Common semantic repository (schemas, Java classes) for most of SQL-on- Hadoop tools Metastore maps DDL to Java access classes 21
How Data is Stored in Hadoop Example: 1TB JSON File {"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8 {"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"n","activity":7 {"custid":1083711,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9 Block B1 {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:32","recommended":"y","activity":7 {"custid":1010220,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:42","recommended":"y","activity":6 {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8 {"custid":1253676,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9 {"custid":1351777,"movieid":608,"genreid":6,"time":"2012-07-01:00:01:03","recommended":"n","activity":7 Block B2 {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9 {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:01:18","recommended":"y","activity":7 {"custid":1067283,"movieid":1124,"genreid":9,"time":"2012-07-01:00:01:26","recommended":"y","activity":7 {"custid":1126174,"movieid":16309,"genreid":9,"time":"2012-07-01:00:01:35","recommended":"n","activity":7 {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:01:39","recommended":"y","activity":7 Block B3 {"custid":1346299,"movieid":424,"genreid":1,"time":"2012-07-01:00:05:02","recommended":"y","activity":4 1 block = 256 MB Example File = 4096 blocks InputSplits = 4096 Potential scan parallelism 22
Hive Storage Handler How MapReduce and Hive Read Data Consumer Scan and row creation needs to be able to work on any data format Create ROWS & COLUMNS Data definitions and column deserializations are needed to provide a table SCAN Data Node disk RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes 23
1011001 0100111 1011001 0100111 10110010 Big Data SQL Server Dataflow Big Data SQL Smart Scan External Table Services 3 2 1 2 Read data from HDFS Data Node Direct-path reads C-based readers when possible Use native Hadoop classes otherwise Translate bytes to Oracle SerDe RecordReader Data Node Disks 1 3 Apply Smart Scan to Oracle bytes Apply filters Project Columns Parse JSON/XML Score models 24
Operations Pushed Down to Hadoop Big Data SQL on Hadoop Cluster Oracle Database 12c Request JSON Pushed down to Big Data SQL Cell Hadoop scans (InputFormat, SerDe) JSON parsing WHERE clause evaluation Storage index evaluation Column projection Bloom filters for faster joins Score Data Mining models Oracle Data Stream Smart Scan Only relevant data are emitted Handled by Oracle Database Query Compilation & Optimization Joins Aggregations Ordering of results PL/SQL evaluation Table functions Security policies 25
Oracle Big Data SQL Storage Index HDFS Field 1, Field 2, Field 3,, Field n 1001 1010 1045 1109 1043 1001 1045 1609 1043 11455 1909 12430 13010 10450 1909 2043 HDFS Block1 (256MB) HDFS Block2 (256MB) Example: Find all ratings from movies with a MOVIE_ID of 1109 Index B1 Movie_ID Min: 1001 Max: 1609 B2 Movie_ID Min: 1909 Max: 13010 Storage index provides query speed-up through transparent IO elimination of HDFS Blocks Columns in SQL are mapped to fields in the HDFS file via External Table Definitions Min / max value is recorded for each HDFS Block in a inmemory storage index 26
Oracle Parallel Query and Hadoop 1 Determine Hadoop Parallelism Determine schema-for-read Determine InputSplits Arrange splits for best performance HDFS NameNode Hive Metastore B B B 1 InputSplits 2 PX 2 3 Map to Oracle Parallelism Map splits to granules Assign batches of granules to PX Servers PX Servers Route Work Send granule requests async to cells Reap results Big Data SQL on Hadoop Cluster Oracle Database 12c 27
Host 4 Host 3 Host 2 Host 1 Oracle and Hadoop Parallelism Big Data SQL on Hadoop Cluster Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Granule Request Async Parallelism defined by Hadoop InputSplits (fan-out from PX to Hadoop) Utilized as many cores as provided by cgroups (first-come-first-serve) Oracle Database 12c PX Server #1 PX Server #2 PX Server #3 PX Server #4 PX Server #5 PX Server #6 PX Server #7 PX Server #8 Parallelism defined by Degree of Parallelism (DOP) dynamic, statement, table level DOP can be throttled by database if maximum DOP exceeded or table too small SELECT /*+PARALLEL(EVE,8)*/ CUST.NAME, CUST.MSISDN, EVE.MONTH, EVE.EVENT_TYPE, COUNT(*) AS EVENT_COUNT, SUM(EVE.DURATION) AS DURATION FROM D_CUSTOMERS CUST, F_NETWORK_EVENTS EVE WHERE CUST.MSISDN = EVE.MSISDN GROUP BY CUST.NAME, CUST.MSISDN, EVE.MONTH, EVE.EVENT_TYPE ORDER BY 1,2,3,4 28
Big Data SQL Prerequisites Oracle 12c on Linux Oracle Exadata Oracle Big Data Appliance Infiniband interconnection between Oracle Exadata and Oracle Big Data Appliance Oracle Big Data Appliance with CDH B B B Infiniband Oracle 12c on Oracle Exadata 29
Demonstration Big Data SQL in Action 30
31
32
33
34
35
Questions and Answers Provided we still have some time 36
Knowledge Check True or False Hive is leveraged by Big Data SQL as a query execution engine - allowing BDS queries to automatically execute faster as the Hive execution engine improves (e.g. Spark replaces MapReduce) False Big Data SQL leverages only the Hive Metastore (HCatalog) and the corresponding classes (InputFormat, RecordReader, SerDe) but it does not use Hive for execution 37
Knowledge Check True or False Oracle Big Data SQL sends all the data from Hadoop to Oracle Database where the query is processed. Oracle Database does column selection, it applies WHERE, GROUP BY and ORDER BY clauses etc. False Big Data SQL Smart Scan performs low level processing of query. Smart Scan does the column projection, it applies WHERE condition & Bloom filters, it processes JSON etc. 38
Knowledge Check True or False Oracle s ambition with Big Data SQL is to supersede all the Hadoop-on-SQL engines like Hive, SparkSQL, Impala, Drill or Presto. False Big Data SQL is for companies with significant Oracle assets (e.g. Oracle Data Warehouse) who wants to access and process both Hadoop and Oracle data from single SQL environment 39
Thank You! 40
Safe Harbor Statement The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 41
42