Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco



Similar documents
Lesson 2 : Hadoop & NoSQL Data Loading using Hadoop Tools and ODI12c Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Lesson 3 : Hadoop Data Processing using ODI12c and CDH5 Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Oracle Big Data Essentials

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloudera Certified Developer for Apache Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing

Hadoop (BDA) and Oracle Technologies on BI Projects Mark Rittman, CTO, Rittman Mead Dutch Oracle Users Group, Jan 14th 2015

Hadoop and Map-Reduce. Swati Gore

Big Data in a Relational World Presented by: Kerry Osborne JPMorgan Chase December, 2012

ITG Software Engineering

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Connecting Hadoop with Oracle Database

Internals of Hadoop Application Framework and Distributed File System

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Meets Exadata. Presented by: Kerry Osborne. DW Global Leaders Program Decemeber, 2012

MySQL and Hadoop Big Data Integration

Big Data Course Highlights

Hadoop Job Oriented Training Agenda

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Oracle Big Data SQL Technical Update

The Hadoop Eco System Shanghai Data Science Meetup

Safe Harbor Statement

OWB Users, Enter The New ODI World

Qsoft Inc

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Workshop on Hadoop with Big Data

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Oracle Big Data Fundamentals Ed 1 NEW

Big Data Introduction

Hadoop IST 734 SS CHUNG

Dominik Wagenknecht Accenture

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

HADOOP. Revised 10/19/2015

Sentimental Analysis using Hadoop Phase 2: Week 2

Native Connectivity to Big Data Sources in MSTR 10

Data processing goes big

An Oracle White Paper February Oracle Data Integrator Performance Guide

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

OBIEE 11g Data Modeling Best Practices

Big Data: Tools and Technologies in Big Data

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Using RDBMS, NoSQL or Hadoop?

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

What s New with Oracle BI, Analytics and DW

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

CitusDB Architecture for Real-Time Big Data

Architecting for the Internet of Things & Big Data

The Future of Data Management

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Business Intelligence for Big Data

Lofan Abrams Data Services for Big Data Session # 2987

<Insert Picture Here> Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Integrating VoltDB with Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Big Data on Microsoft Platform

COURSE CONTENT Big Data and Hadoop Training

TRAINING PROGRAM ON BIGDATA/HADOOP

Oracle Data Integrator 12c New Features Overview Advancing Big Data Integration O R A C L E W H I T E P A P E R M A R C H

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Practical Hadoop by Example

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Oracle Database 12c Plug In. Switch On. Get SMART.

Real Time Big Data Processing

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Big Data With Hadoop

Putting Apache Kafka to Use!

An Oracle White Paper June Oracle: Big Data for the Enterprise

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Integrating Big Data into the Computing Curricula

Transcription:

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

About the Speaker Mark Rittman, Co-Founder of Rittman Mead Oracle ACE Director, specialising in Oracle BI&DW 14 Years Experience with Oracle Technology Regular columnist for Oracle Magazine Author of two Oracle Press Oracle BI books Oracle Business Intelligence Developers Guide Oracle Exalytics Revealed Writer for Rittman Mead Blog : http://www.rittmanmead.com/blog Email : mark.rittman@rittmanmead.com Twitter : @markrittman

Traditional Data Warehouse / BI Architectures Three-layer architecture - staging, foundation and access/performance All three layers stored in a relational database (Oracle) ETL used to move data from layer-to-layer Traditional Relational Data Warehouse Data Load Staging Foundation / ODS Performance / Dimensional Direct Read BI Tool (OBIEE) with metadata layer Traditional structured data sources Data Load Data Load ETL ETL Data Load OLAP / In-Memory Tool with data load into own database

Introducing Hadoop A new approach to data processing and data storage Rather than a small number of large, powerful servers, it spreads processing over large numbers of small, cheap, redundant servers Spreads the data you re processing over lots of distributed nodes Has scheduling/workload process that sends Job Tracker parts of a job to each of the nodes - a bit like Oracle Parallel Execution And does the processing where the data sits - a bit like Exadata storage servers Shared-nothing architecture Low-cost and highly horizontal scalable Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Task Tracker Task Tracker

Hadoop Tenets : Simplified Distributed Processing Hadoop, through MapReduce, breaks processing down into simple stages Map : select the columns and values you re interested in, pass through as key/value pairs Reduce : aggregate the results Most ETL jobs can be broken down into filtering, projecting and aggregating Hadoop then automatically runs job on cluster Share-nothing small chunks of work Run the job on the node where the data is Handle faults etc Gather the results back in Mapper Filter, Project Reducer Aggregate Mapper Filter, Project Reducer Aggregate Mapper Filter, Project Output One HDFS file per reducer, in a directory

Moving Data In, Around and Out of Hadoop Three stages to Hadoop ETL work, with dedicated Apache / other tools Load : receive files in batch, or in real-time (logs, events) Transform : process & transform data to answer questions Store / Export : store in structured form, or export to RDBMS using Sqoop RDBMS Imports Real-Time Logs / Events Loading Stage! Processing Stage! Store / Export Stage! File Exports RDBMS Exports File / Unstructured Imports

ETL Offloading Special use-case : offloading low-value, simple ETL work to a Hadoop cluster Receiving, aggregating, filtering and pre-processing data for an RDBMS data warehouse Potentially free-up high-value Exadata / RBDMS servers for analytic work

Core Apache Hadoop Tools Apache Hadoop, including MapReduce and HDFS Scaleable, fault-tolerant file storage for Hadoop Parallel programming framework for Hadoop Apache Hive SQL abstraction layer over HDFS Perform set-based ETL within Hadoop Apache Pig, Spark Dataflow-type languages over HDFS, Hive etc Extensible through UDFs, streaming etc Apache Flume, Apache Sqoop, Apache Kafka Real-time and batch loading into HDFS Modular, fault-tolerant, wide source/target coverage

Hive as the Hadoop SQL Access Layer MapReduce jobs are typically written in Java, but Hive can make this simpler Hive is a query environment over Hadoop/MapReduce to support SQL-like queries Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically creates MapReduce jobs against data previously loaded into the Hive HDFS tables Approach used by ODI and OBIEE to gain access to Hadoop data Allows Hadoop data to be accessed just like any other data source (sort of...)

Oracle s Big Data Products Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing Cloudera Distribution of Hadoop Cloudera Manager Open-source R Oracle NoSQL Database Oracle Enterprise Linux + Oracle JVM New - Oracle Big Data SQL Oracle Big Data Connectors Oracle Loader for Hadoop (Hadoop > Oracle RBDMS) Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS) Oracle R Advanced Analytics for Hadoop Oracle Data Integrator 12c

Oracle Loader for Hadoop Oracle technology for accessing Hadoop data, and loading it into an Oracle database Pushes data transformation, heavy lifting to the Hadoop cluster, using MapReduce Direct-path loads into Oracle Database, partitioned and non-partitioned Online and offline loads Key technology for fast load of Hadoop results into Oracle DB

Oracle Direct Connector for HDFS Enables HDFS as a data-source for Oracle Database external tables Effectively provides Oracle SQL access over HDFS Supports data query, or import into Oracle DB Treat HDFS-stored files in the same way as regular files But with HDFS s low-cost and fault-tolerance

Oracle R Advanced Analytics for Hadoop Add-in to R that extends capability to Hadoop Gives R the ability to create Map and Reduce functions Extends R data frames to include Hive tables Automatically run R functions on Hadoop by using Hive tables as source

Just Released - Oracle Big Data SQL Part of Oracle Big Data 4.0 (BDA-only) Also requires Oracle Database 12c, Oracle Exadata Database Machine Extends Oracle Data Dictionary to cover Hive Extends Oracle SQL and SmartScan to Hadoop Extends Oracle Security Model over Hadoop Fine-grained access control Data redaction, data masking SQL Queries Exadata Database Server SmartScan SmartScan Exadata Storage Servers Hadoop Cluster Oracle Big Data SQL

Bringing it All Together : Oracle Data Integrator 12c ODI provides an excellent framework for running Hadoop ETL jobs ELT approach pushes transformations down to Hadoop - leveraging power of cluster Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation Whilst still preserving RDBMS push-down Extensible to cover Pig, Spark etc Process orchestration Data quality / error handling Metadata and model-driven

ODI and Big Data Integration Example In this example, we ll show an end-to-end ETL process on Hadoop using ODI12c & BDA Scenario: load webserver log data into Hadoop, process enhance and aggregate, then load final summary table into Oracle Database 12c Process using Hadoop framework Leverage Big Data Connectors Metadata-based ETL development using ODI12c Real-world example

ETL & Data Flow through BDA System Five-step process to load, transform, aggregate and filter incoming log data Leverage ODI s capabilities where possible Make use of Hadoop power + scalability! Flume Agent Apache HTTP Server Flume Messaging TCP Port 4545 (example) 1 Flume Agent Log Files (HDFS) IKM File to Hive using RegEx SerDe hive_raw_apache_ access_log (Hive Table) posts (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) 2 3 log_entries_ and post_detail (Hive Table)! Sqoop extract categories_sql_ extract (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) hive_raw_apache_ access_log (Hive Table) 4 Geocoding IP>Country list (Hive Table) IKM Hive Transform (Hive streaming through Python script) hive_raw_apache_ access_log (Hive Table) 5 IKM File / Hive to Oracle (bulk unload to Oracle DB)

ETL Considerations : Using Hive vs. Regular Oracle SQL Not all join types are available in Hive - joins must be equality joins No sequences, no primary keys on tables Generally need to stage Oracle or other external data into Hive before joining to it Hive latency - not good for small microbatch-type work But other alternatives exist - Spark, Impala etc Hive is INSERT / APPEND only - no updates, deletes etc But HBase may be suitable for CRUD-type loading Don t assume that HiveQL == Oracle SQL Test assumptions before committing to platform vs.

Five-Step ETL Process 1. Take the incoming log files (via Flume) and load into a structured Hive table 2. Enhance data from that table to include details on authors, posts from other Hive tables 3. Join to some additional ref. data held in an Oracle database, to add author details 4. Geocode the log data, so that we have the country for each calling IP address 5. Output the data in summary form to an Oracle database

Using Flume to Transport Log Files to BDA Apache Flume is the standard way to transport log files from source through to target Initial use-case was webserver log files, but can transport any file from A>B Does not do data transformation, but can send to multiple targets / target types Mechanisms and checks to ensure successful transport of entries Has a concept of agents, sinks and channels Agents collect and forward log data Sinks store it in final destination Channels store log data en-route Simple configuration through INI files Handled outside of ODI12c

GoldenGate for Continuous Streaming to Hadoop Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop Leverages GoldenGate & HDFS / Hive Java APIs Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive) Likely to be formal part of GoldenGate in future release - but usable now Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1

1 Load Incoming Log Files into Hive Table First step in process is to load the incoming log files into a Hive table Also need to parse the log entries to extract request, date, IP address etc columns Hive table can then easily be used in downstream transformations Use IKM File to Hive (LOAD DATA) KM Source can be local files or HDFS Either load file into Hive HDFS area, or leave as external Hive table Ability to use SerDe to parse file data

Using IKM File to Hive to Load Web Log File Data into Hive Create mapping to load file source (single column for weblog entries) into Hive table Target Hive table should have column for incoming log row, and parsed columns

Specifying a SerDe to Parse Incoming Hive Data SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats Distributed as JAR file, gives Hive ability to parse semi-structured formats We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option

2 Join to Additional Hive Tables, Transform using HiveQL IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc. INSERT only, no DELETE, UPDATE etc Not all ODI12c mapping operators supported, but basic functionality works OK Use this KM to join to other Hive tables, adding more details on post, title etc Perform DISTINCT on join output, load into summary Hive table

Joining Hive Tables Only equi-joins supported Must use ANSI syntax More complex joins may not produce valid HiveQL (subqueries etc)

Filtering, Aggregating and Transforming Within Hive Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping operators can be added to mapping to manipulate data Generates HiveQL functions, clauses etc

3 Bring in Reference Data from Oracle Database In this third step, additional reference data from Oracle Database needs to be added In theory, should be able to add Oracle-sourced datastores to mapping and join as usual But Oracle / JDBC-generic LKMs don t get work with Hive

Options for Importing Oracle / RDBMS Data into Hadoop Could export RBDMS data to file, and load using IKM File to Hive Oracle Big Data Connectors only export to Oracle, not import to Hadoop Best option is to use Apache Sqoop, and new IKM SQL to Hive-HBase-File knowledge module Hadoop-native, automatically runs in parallel Uses native JDBC drivers, or OraOop (for example) Bi-directional in-and-out of Hadoop to RDBMS Run from OS command-line

Loading RDBMS Data into Hive using Sqoop First step is to stage Oracle data into equivalent Hive table Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source Passes responsibility for load (extract) to following IKM Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table

Join Oracle-Sourced Hive Table to Existing Hive Table Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal Filters, aggregation operators etc can be added to mapping if required Use IKM Hive Control Append as integration KM

New Option - Using Oracle Big Data SQL Oracle Big Data SQL provides ability for Exadata to reference Hive tables Use feature to create join in Oracle, bringing across Hive data through ORACLE_HIVE table

4 Using Hive Streaming and Python for Geocoding Data Another requirement we have is to geocode the webserver log entries Allows us to aggregate page views by country Based on the fact that IP ranges can usually be attributed to specific countries Not functionality normally found in Hive etc, but can be done with add-on APIs

How GeoIP Geocoding Works Uses free Geocoding API and database from Maxmind Convert IP address to an integer Find which integer range our IP address sits within But Hive can t use BETWEEN in a join

Solution : IKM Hive Transform IKM Hive Transform can pass the output of a Hive SELECT statement through a perl, python, shell etc script to transform content Uses Hive TRANSFORM USING AS functionality hive> add file file:///tmp/add_countries.py; Added resource: file:///tmp/add_countries.py hive> select transform (hostname,request_date,post_id,title,author,category) > using 'add_countries.py' > as (hostname,request_date,post_id,title,author,category,country) > from access_per_post_categories;

Creating the Python Script for Hive Streaming Solution requires a Python API to be installed on all Hadoop nodes, along with geocode DB! wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py python! get-pip.py pip install pygeoip! Python script then parses incoming stdin lines using tab-separation of fields, outputs same (but with extra field for the country) #!/usr/bin/python import sys sys.path.append('/usr/lib/python2.6/site-packages/') import pygeoip gi = pygeoip.geoip('/tmp/geoip.dat') for line in sys.stdin: line = line.rstrip() hostname,request_date,post_id,title,author,category = line.split('\t') country = gi.country_name_by_addr(hostname) print hostname+'\t'+request_date+'\t'+post_id+'\t'+title+'\t'+author +'\t'+country+'\t'+category

Setting up the Mapping Map source Hive table to target, which includes column for extra country column!!!!!!! Copy script + GeoIP.dat file to every node s /tmp directory Ensure all Python APIs and libraries are installed on each Hadoop node

Configuring IKM Hive Transform TRANSFORM_SCRIPT_NAME specifies name of script, and path to script TRANSFORM_SCRIPT has issues with parsing; do not use, leave blank and KM will use existing one Optional ability to specify sort and distribution columns (can be compound) Leave other options at default

Executing the Mapping KM automatically registers the script with Hive (which caches it on all nodes) HiveQL output then runs the contents of the first Hive table through the script, outputting results to target table

5 Bulk Unload Summary Data to Oracle Database Final requirement is to unload final Hive table contents to Oracle Database Several use-cases for this: Use Hadoop / BDA for ETL offloading Use analysis capabilities of BDA, but then output results to RDBMS data mart or DW Permit use of more advanced SQL query tools Share results with other applications Can use Sqoop for this, or use Oracle Big Data Connectors Fast bulk unload, or transparent Oracle access to Hive

IKM File/Hive to Oracle (OLH/ODCH) KM for accessing HDFS/Hive data from Oracle Either sets up ODCH connectivity, or bulk-unloads via OLH Map from HDFS or Hive source to Oracle tables (via Oracle technology in Topology)

Configuring the KM Physical Settings For the access table in Physical view, change LKM to LKM SQL Multi-Connect Delegates the multi-connect capabilities to the downstream node, so you can use a multiconnect IKM such as IKM File/Hive to Oracle

Create Package to Sequence ETL Steps Define package (or load plan) within ODI12c to orchestrate the process Call package / load plan execution from command-line, web service call, or schedule

Execute Overall Package Each step executed in sequence End-to-end ETL process, using ODI12c s metadata-driven development process, data quality handing, heterogenous connectivity, but Hadoop-native processing

Conclusions Hadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture, analysis and processing Hadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process and analyse data in parallel, using languages + approach familiar to Oracle developers ODI12c provides several benefits when working with ETL and data loading on Hadoop Metadata-driven design; data quality handling; KMs to handle technical complexity Oracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sources In this presentation, we ve seen an end-to-end example of big data ETL using ODI The power of Hadoop and BDA, with the ETL orchestration of ODI12c

Thank You for Attending! Thank you for attending this presentation, and more information can be found at http:// www.rittmanmead.com Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com Look out for our book, Oracle Business Intelligence Developers Guide out now! Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)

Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco