Seamless Access from Oracle Database to Your Big Data



Similar documents
Big Data: Are you ready?

Oracle Big Data SQL Architectural Deep Dive

Big Data Management System Solution Overview

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Big Data SQL and Query Franchising

Oracle Big Data SQL Konference Data a znalosti 2015

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

Sun / Oracle Life Science Platform From Deluge to Discovery Oracle Corporation

SQL - the best analysis language for Big Data!

Oracle Big Data SQL Technical Update

Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

How To Manage Big Data In A Microsoft Cloud (Hadoop)

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Safe Harbor Statement

OLSUG Workshop Oracle Data Mining

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Oracle Data Mining In-Database Data Mining Made Easy!

Analyzing Big Data. Heartland OUG Spring Conference 2014

Constructing a Data Lake: Hadoop and Oracle Database United!

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Analytics with Oracle Advanced Analytics In-Database Option

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Handbook

I/O Considerations in Big Data Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle's In-Database Statistical Functions

Big Data Introduction

Integrating Apache Spark with an Enterprise Data Warehouse

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Oracle Big Data Building A Big Data Management System

Oracle Database 12c Plug In. Switch On. Get SMART.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Native Connectivity to Big Data Sources in MSTR 10

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Big Data Technologies Compared June 2014

Hadoop Ecosystem B Y R A H I M A.

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Hadoop & Spark Using Amazon EMR

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

TUT NoSQL Seminar (Oracle) Big Data

Oracle Big Data Essentials

Data Domain Profiling and Data Masking for Hadoop

Apache Sentry. Prasad Mujumdar

The Future of Data Management

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Architecting for the Internet of Things & Big Data

<Insert Picture Here> Big Data

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

extreme Datamining mit Oracle R Enterprise

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Using RDBMS, NoSQL or Hadoop?

2009 Oracle Corporation 1

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Oracle Big Data Strategy Simplified Infrastrcuture

Oracle Big Data Fundamentals Ed 1 NEW

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Data processing goes big

Cyber Security With Big Data

Inge Os Sales Consulting Manager Oracle Norway

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Predictive Analytics for Better Business Intelligence

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Google Bing Daytona Microsoft Research

How To Scale Out Of A Nosql Database

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data Are You Ready? Thomas Kyte

ORACLE BIG DATA APPLIANCE X3-2

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Using distributed technologies to analyze Big Data

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

A Performance Analysis of Distributed Indexing using Terrier

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Oracle Big Data Appliance X5-2

Self-service BI for big data applications using Apache Drill

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop. Sunday, November 25, 12

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Hadoop Eco System Shanghai Data Science Meetup

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

How to Choose Between Hadoop, NoSQL and RDBMS

Cloudera Backup and Disaster Recovery

Exadata for Oracle DBAs. Longtime Oracle DBA

Hadoop Big Data for Processing Data and Performing Workload

Hadoop: Embracing future hardware

Enabling High performance Big Data platform with RDMA

Transcription:

Seamless Access from Oracle Database to Your Big Data Brian Macdonald Big Data and Analytics Specialist Oracle Enterprise Architect September 24, 2015

Agenda Hadoop and SQL access methods What is Oracle Big Data SQL Big Data SQL Architecture Big Data SQL Configuration Roadmap Customer Story Q&A 9/23/2015 2

First Lets Define Big Data & Structured & Unstructured Data

SQL on Hadoop is Obvious Although Implementations Vary Hive Impala, HAWQ, IBM Big SQL Oracle SQL Connector for Hadoop (OSCH) Oracle Big Data SQL A million more (Tez, Presto, Hadapt, Stinger, Polybase, Drill, Lots of start ups) Stinger

SQL Analytics Challenge Separate silos of information to analyze 5

SQL Analytics Challenge No comprehensive SQL interface 6

Oracle Big Data SQL Hadoop + NoSQL + Relational 7

Oracle Big Data SQL A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Oracle Security Govern all Data through a Single Set of Security Policies Redaction, VPD, etc. Tool Access 8

Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance

Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance

A Smarter Oracle External Table Oracle Table You define: Table name Oracle types Any Degree of Parallelism HDFS Data You get: Automatic discovery of Hive table metadata Automatic translation from Hadoop types Automatic conversion from any InputFormat Fan-out Parallelism across the Hadoop cluster 12

Unify Metadata: Publish Hive Metadata to Oracle Catalog Hive Metastore CREATE TABLE movieapp_log_json (click VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) REJECT LIMIT UNLIMITED; Oracle Catalog Hive metadata External Table External Table Big Data Appliance + Hadoop/NoSQL Exadata + Oracle Database 13

Accessible through Oracle Data Dictionary Immediately So the DBA doesn t need to go to Hadoop ALL_HIVE_DATABASES ALL_HIVE_TABLES ALL_HIVE_COLUMNS DBA_HIVE_DATABASES DBA_HIVE_TABLES DBA_HIVE_COLUMNS USER_HIVE_DATABASES USER_HIVE_TABLES USER_HIVE_COLUMNS

Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (inherit metadata) ORACLE_HDFS (specify metadata) Access parameters for Big Data Hadoop cluster Remote Hive database/table DBMS_HADOOP Package for automatic import SQLDeveloper Integration (Create Table) 15

SQLDeveloper Integration

How Data is Stored in Hadoop As files. Pretty Simple Example: 1TB File {"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"n","activity":7} {"custid":1083711,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:32","recommended":"y","activity":7} {"custid":1010220,"movieid":11547,"genreid":44,"time":"2012-07-01:00:00:42","recommended":"y","activity":6} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custid":1253676,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custid":1351777,"movieid":608,"genreid":6,"time":"2012-07-01:00:01:03","recommended":"n","activity":7} {"custid":1143971,"movieid":null,"genreid":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:01:18","recommended":"y","activity":7} {"custid":1067283,"movieid":1124,"genreid":9,"time":"2012-07-01:00:01:26","recommended":"y","activity":7} {"custid":1126174,"movieid":16309,"genreid":9,"time":"2012-07-01:00:01:35","recommended":"n","activity":7} {"custid":1234182,"movieid":11547,"genreid":44,"time":"2012-07-01:00:01:39","recommended":"y","activity":7}} {"custid":1346299,"movieid":424,"genreid":1,"time":"2012-07-01:00:05:02","recommended":"y","activity":4} CREATE TABLE ORDER (custid VARCHAR2(10), recommended VARCHAR2(20), activity (NUMBER 8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs) LOCATION ("hdfs:/usr/cust/summary/*"); Assumes Default Values Table Options Fields Column Maps Delimiters Fileformats json, textfile, sequencefile, Serdes i.e regex More (See Docs) 17

Creating an External Table against Hive DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE ( cluster_id IN VARCHAR2, db_name IN VARCHAR2 := NULL, hive_table_name IN VARCHAR2, hive_partition IN BOOLEAN, table_name IN VARCHAR2 := NULL, perform_ddl IN BOOLEAN DEFAULT FALSE, text_of_ddl OUT VARCHAR2 ); set serveroutput on DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=>'bigdatalite', DB_NAME=>'brian', HIVE_TABLE_NAME=>'movie', HIVE_PARTITION=>FALSE, TABLE_NAME=>'movie', PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLout); dbms_output.put_line(ddlout); END;

Oracle External Tables Flexibility for Varied File Structures CREATE TABLE ORDER ( cust_num VARCHAR2(10), order_num VARCHAR2(20), order_total NUMBER(8,2)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ) PARALLEL 20 REJECT LIMIT UNLIMITED; Transparent schema-for-read Use fast C-based readers when possible Use native Hadoop classes otherwise Engineered to understand parallelism Map external units of parallelism to Oracle Architected for extensibility StorageHandler capability enables support for other data sources Examples: MongoDB, HBase, Oracle NoSQL DB 19

StorageHandlers: Extensibility Beyond HDFS Oracle Big Data SQL StorageHandlers are a metadata bridge. Hive Metastore

Oracle Big Data SQL Architecture Two components of Oracle Big Data SQL External Table extension Big Data SQL Server Software on Big Data Appliance

What gives Exadata extreme performance? SQL Small data subset quickly returned Offload Query to Exadata Storage Servers Oracle Database 12c 22

Introducing Oracle Big Data SQL Massively Parallel SQL Query across Oracle, Hadoop and NoSQL Hadoop & NoSQL Oracle Database 12c 23

Big Data Appliance X5-2 Sun Oracle X5-2L Servers with per server: 2 * 18 Core Intel Xeon E5 Processors 128 GB Memory 96TB Disk space Integrated Software (4.2): Oracle Linux 6.6 Oracle Big Data SQL 1.1* Cloudera Distribution of Apache Hadoop 5.4 EDH Edition Cloudera Manager 5.4 Oracle R Distribution Oracle NoSQL Database CE * Oracle Big Data SQL is separately licensed 24

Introducing Oracle Big Data SQL Massively Parallel SQL Query across Oracle, Hadoop and NoSQL SQL SQL Offload Query to Data Nodes data subset Small data subset quickly returned Offload Query to Exadata Storage Servers Hadoop & NoSQL Oracle Database 12c 25

Big Data SQL Server: A New Hadoop Processing Engine Processing Layer MapReduce and Hive Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) 26

Big Data SQL Query Execution How do we query Hadoop? HDFS NameNode 1 1 2 Query compilation determines: Data locations Data structure Parallelism Fast reads using Big Data SQL Server Schema-for-read using Hadoop classes Smart Scan selects only relevant data Hive Metastore HDFS Data Node BDSQL 2 3 3 Process filtered result Move relevant data to database Join with database tables Apply database security policies HDFSData Node BDSQL

Apply Advanced Security on Hadoop & NoSQL Same security policies across all data Redaction JSON Raw JSON data in Hadoop SQL Customer data in Oracle Virtual Private Database Fine-grain Access Control Hadoop Redacted data subset Oracle Database 12c DBMS_REDACT.ADD_POLICY( object_schema => 'sales', object_name => 'customer_detail', column_name => 'last_name', policy_name => 'customer_privacy', function_type => DBMS_REDACT.FULL, expression => '1=1' ); 28

Configuration Install Oracle Big Data SQL on the BDA using Mammoth Run the Big Data SQL-Exadata installation script on each Oracle Exadata database node Sets up connectivity from Exadata to the Big Data SQL Servers on the BDA. Installs a Hadoop client Configure directories and files Big Data SQL Agent Oracle directory objects Others

Directories Two Types of directories are created Common Directory must be on cluster wide shared files system Subdirectories for jar files bigdata.properties (paths,etc.) Cluster Directory(s) Configuration details for each BDA Cluster Sub directory of Common directory Oracle Directories that point to these Dirs ORACLE_BIGDATA_CONFIG Common Directory ORACLE_BIGDATA_CL_XXXX One for each Cluster directory (case sensitive)

Big Data SQL Agents Created by Install Script This multi-threaded agent bridges the metadata between Oracle Database and Hadoop. It launches a single JVM - instead of one for every process (which can be quite slow). create public database link BDSQL$_XXXX using 'extproc_connection_data'; (XXXX is the name of each BDA cluster from Cluster Directories create public database link BDSQL$_DEFAULT_CLUSTER using 'extproc_connection_data';

If Kerberos is used on BDA Must create ticket (kinit) for BDS user BDS runs as Oracle User Need to renew tickets cron Other automation to be released soon

Requirements - For Now Exadata Oracle 12.1.0.2.1+ Storage Servers 2.1.1.1 or 12.1.1.0 Exadata configured on the same InfiniBand subnet as BDA Exadata and BDA connected by InfiniBand

Roadmap Subsequent content subject to change!

Enhanced Parallelism Today Hadoop DoP linked to RDBMS DoP Lead to many idle PQ processes Required explicit declaration Next Unlink Hadoop and RDBMS DoP Automatic max Hadoop parallelism Even on serial tables An average of 40% faster Even at equivalent DoP

Storage Indexing Today All blocks in a query must be read from disk Large (256MB) disk I/O for each block Next Automatically create Storage Indexes in Big Data SQL Agents Check index before reading blocks Skip unnecessary I/Os An average of 65% faster Up to 100x faster for highly selective queries

Customer Examples 37

Building Customer Loyalty Company Overview Customer loyalty marketing and programs for major retailers and consumer brands Challenges Deliver personalized multi-channel content to every customer (example: Kroger s MyMagazine ) Expand to a wide variety of interaction data to build customer profiles Benefits 2x improvements in campaign performance Large-scale concurrent processing of complex SQL 70% of analysis is done in SQL, uses R as well Solution Overview Oracle Exadata X3-8 Oracle Database with Advanced Analytics Oracle ZFS Backup Appliance Big Data Appliance Next: Big Data SQL SQL Analysis R-based Analysis Machine Learning ZFS X3-8 X3-8 Source Systems (at Client) BDA

Thank You & Q&A