Seamless Access from Oracle Database to Your Big Data Brian Macdonald Big Data and Analytics Specialist Oracle Enterprise Architect September 24, 2015
Agenda Hadoop and SQL access methods What is Oracle Big Data SQL Big Data SQL Architecture Big Data SQL Configuration Roadmap Customer Story Q&A 9/23/2015 2
First Lets Define Big Data & Structured & Unstructured Data
SQL on Hadoop is Obvious Although Implementations Vary Hive Impala, HAWQ, IBM Big SQL Oracle SQL Connector for Hadoop (OSCH) Oracle Big Data SQL A million more (Tez, Presto, Hadapt, Stinger, Polybase, Drill, Lots of start ups) Stinger
SQL Analytics Challenge Separate silos of information to analyze 5
SQL Analytics Challenge No comprehensive SQL interface 6
Oracle Big Data SQL Hadoop + NoSQL + Relational 7
Oracle Big Data SQL A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Oracle Security Govern all Data through a Single Set of Security Policies Redaction, VPD, etc. Tool Access 8
Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential
Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance
Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance
A Smarter Oracle External Table Oracle Table You define: Table name Oracle types Any Degree of Parallelism HDFS Data You get: Automatic discovery of Hive table metadata Automatic translation from Hadoop types Automatic conversion from any InputFormat Fan-out Parallelism across the Hadoop cluster 12
Unify Metadata: Publish Hive Metadata to Oracle Catalog Hive Metastore CREATE TABLE movieapp_log_json (click VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) REJECT LIMIT UNLIMITED; Oracle Catalog Hive metadata External Table External Table Big Data Appliance + Hadoop/NoSQL Exadata + Oracle Database 13
Accessible through Oracle Data Dictionary Immediately So the DBA doesn t need to go to Hadoop ALL_HIVE_DATABASES ALL_HIVE_TABLES ALL_HIVE_COLUMNS DBA_HIVE_DATABASES DBA_HIVE_TABLES DBA_HIVE_COLUMNS USER_HIVE_DATABASES USER_HIVE_TABLES USER_HIVE_COLUMNS
Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (inherit metadata) ORACLE_HDFS (specify metadata) Access parameters for Big Data Hadoop cluster Remote Hive database/table DBMS_HADOOP Package for automatic import SQLDeveloper Integration (Create Table) 15
SQLDeveloper Integration
How Data is Stored in Hadoop As files. Pretty Simple Example: 1TB File {"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"n","activity":7} {"custid":1083711,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:32","recommended":"y","activity":7} {"custid":1010220,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:42","recommended":"y","activity":6} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custid":1253676,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custid":1351777,"movieid":608,"genreid":6,"time":"2012-07-01:00:01:03","recommended":"n","activity":7} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:01:18","recommended":"y","activity":7} {"custid":1067283,"movieid":1124,"genreid":9,"time":"2012-07-01:00:01:26","recommended":"y","activity":7} {"custid":1126174,"movieid":16309,"genreid":9,"time":"2012-07-01:00:01:35","recommended":"n","activity":7} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:01:39","recommended":"y","activity":7}} {"custid":1346299,"movieid":424,"genreid":1,"time":"2012-07-01:00:05:02","recommended":"y","activity":4} CREATE TABLE ORDER (custid VARCHAR2(10), recommended VARCHAR2(20), activity (NUMBER 8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs) LOCATION ("hdfs:/usr/cust/summary/*"); Assumes Default Values Table Options Fields Column Maps Delimiters Fileformats json, textfile, sequencefile, Serdes i.e regex More (See Docs) 17
Creating an External Table against Hive DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE ( cluster_id IN VARCHAR2, db_name IN VARCHAR2 := NULL, hive_table_name IN VARCHAR2, hive_partition IN BOOLEAN, table_name IN VARCHAR2 := NULL, perform_ddl IN BOOLEAN DEFAULT FALSE, text_of_ddl OUT VARCHAR2 ); set serveroutput on DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=>'bigdatalite', DB_NAME=>'brian', HIVE_TABLE_NAME=>'movie', HIVE_PARTITION=>FALSE, TABLE_NAME=>'movie', PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLout); dbms_output.put_line(ddlout); END;
Oracle External Tables Flexibility for Varied File Structures CREATE TABLE ORDER ( cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) PARALLEL 20 REJECT LIMIT UNLIMITED; Transparent schema-for-read Use fast C-based readers when possible Use native Hadoop classes otherwise Engineered to understand parallelism Map external units of parallelism to Oracle Architected for extensibility StorageHandler capability enables support for other data sources Examples: MongoDB, HBase, Oracle NoSQL DB 19
StorageHandlers: Extensibility Beyond HDFS Oracle Big Data SQL StorageHandlers are a metadata bridge. Hive Metastore
Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance
What gives Exadata extreme performance? SQL Small data subset quickly returned Offload Query to Exadata Storage Servers Oracle Database 12c 22
Introducing Oracle Big Data SQL Massively Parallel SQL Query across Oracle, Hadoop and NoSQL Hadoop & NoSQL Oracle Database 12c 23
Big Data Appliance X5-2 Sun Oracle X5-2L Servers with per server: 2 * 18 Core Intel Xeon E5 Processors 128 GB Memory 96TB Disk space Integrated Software (4.2): Oracle Linux 6.6 Oracle Big Data SQL 1.1* Cloudera Distribution of Apache Hadoop 5.4 EDH Edition Cloudera Manager 5.4 Oracle R Distribution Oracle NoSQL Database CE * Oracle Big Data SQL is separately licensed 24
Introducing Oracle Big Data SQL Massively Parallel SQL Query across Oracle, Hadoop and NoSQL SQL SQL Offload Query to Data Nodes data subset Small data subset quickly returned Offload Query to Exadata Storage Servers Hadoop & NoSQL Oracle Database 12c 25
Big Data SQL Server: A New Hadoop Processing Engine Processing Layer MapReduce and Hive Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) 26
Big Data SQL Query Execution How do we query Hadoop? HDFS NameNode 1 1 2 Query compilation determines: Data locations Data structure Parallelism Fast reads using Big Data SQL Server Schema-for-read using Hadoop classes Smart Scan selects only relevant data Hive Metastore HDFS Data Node BDSQL 2 3 3 Process filtered result Move relevant data to database Join with database tables Apply database security policies HDFSData Node BDSQL
Apply Advanced Security on Hadoop & NoSQL Same security policies across all data Redaction JSON Raw JSON data in Hadoop SQL Customer data in Oracle Virtual Private Database Fine-grain Access Control Hadoop Redacted data subset Oracle Database 12c DBMS_REDACT.ADD_POLICY( object_schema => 'sales', object_name => 'customer_detail', column_name => 'last_name', policy_name => 'customer_privacy', function_type => DBMS_REDACT.FULL, expression => '1=1' ); 28
Configuration Install Oracle Big Data SQL on the BDA using Mammoth Run the Big Data SQL-Exadata installation script on each Oracle Exadata database node Sets up connectivity from Exadata to the Big Data SQL Servers on the BDA. Installs a Hadoop client Configure directories and files Big Data SQL Agent Oracle directory objects Others
Directories Two Types of directories are created Common Directory must be on cluster wide shared files system Subdirectories for jar files bigdata.properties (paths,etc.) Cluster Directory(s) Configuration details for each BDA Cluster Sub directory of Common directory Oracle Directories that point to these Dirs ORACLE_BIGDATA_CONFIG Common Directory ORACLE_BIGDATA_CL_XXXX One for each Cluster directory (case sensitive)
Big Data SQL Agents Created by Install Script This multi-threaded agent bridges the metadata between Oracle Database and Hadoop. It launches a single JVM - instead of one for every process (which can be quite slow). create public database link BDSQL$_XXXX using 'extproc_connection_data'; (XXXX is the name of each BDA cluster from Cluster Directories create public database link BDSQL$_DEFAULT_CLUSTER using 'extproc_connection_data';
If Kerberos is used on BDA Must create ticket (kinit) for BDS user BDS runs as Oracle User Need to renew tickets cron Other automation to be released soon
Requirements - For Now Exadata Oracle 12.1.0.2.1+ Storage Servers 2.1.1.1 or 12.1.1.0 Exadata configured on the same InfiniBand subnet as BDA Exadata and BDA connected by InfiniBand
Roadmap Subsequent content subject to change!
Enhanced Parallelism Today Hadoop DoP linked to RDBMS DoP Lead to many idle PQ processes Required explicit declaration Next Unlink Hadoop and RDBMS DoP Automatic max Hadoop parallelism Even on serial tables An average of 40% faster Even at equivalent DoP
Storage Indexing Today All blocks in a query must be read from disk Large (256MB) disk I/O for each block Next Automatically create Storage Indexes in Big Data SQL Agents Check index before reading blocks Skip unnecessary I/Os An average of 65% faster Up to 100x faster for highly selective queries
Customer Examples 37
Building Customer Loyalty Company Overview Customer loyalty marketing and programs for major retailers and consumer brands Challenges Deliver personalized multi-channel content to every customer (example: Kroger s MyMagazine ) Expand to a wide variety of interaction data to build customer profiles Benefits 2x improvements in campaign performance Large-scale concurrent processing of complex SQL 70% of analysis is done in SQL, uses R as well Solution Overview Oracle Exadata X3-8 Oracle Database with Advanced Analytics Oracle ZFS Backup Appliance Big Data Appliance Next: Big Data SQL SQL Analysis R-based Analysis Machine Learning ZFS X3-8 X3-8 Source Systems (at Client) BDA
Thank You & Q&A