Constructing a Data Lake: Hadoop and Oracle Database United!



Similar documents
Connecting Hadoop with Oracle Database

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Ecosystem B Y R A H I M A.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop Job Oriented Training Agenda

Oracle Big Data Fundamentals Ed 1 NEW

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Hadoop: Past, Present, and Future

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Oracle Big Data Essentials

Chase Wu New Jersey Ins0tute of Technology

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Brief Outline on Bigdata Hadoop

ITG Software Engineering

Hadoop. for relational database professionals. Alex Gorbachev 17-May-2013 Atlanta, GA

Cloudera Certified Developer for Apache Hadoop

Safe Harbor Statement

Large scale processing using Hadoop. Ján Vaňo

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hadoop and Map-Reduce. Swati Gore

Practical Hadoop by Example

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Deploying Hadoop with Manager

Hadoop IST 734 SS CHUNG

Communicating with the Elephant in the Data Center

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Too Big To Ignore

A very short Intro to Hadoop

Using distributed technologies to analyze Big Data

Peers Techno log ies Pv t. L td. HADOOP

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

HDFS. Hadoop Distributed File System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Big Data Introduction

Big Data With Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Internals of Hadoop Application Framework and Distributed File System

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Upcoming Announcements

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

How To Scale Out Of A Nosql Database

Big Data in a Relational World Presented by: Kerry Osborne JPMorgan Chase December, 2012

Big Data Course Highlights

Move Data from Oracle to Hadoop and Gain New Business Insights

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Qsoft Inc

BIG DATA HADOOP TRAINING

<Insert Picture Here> Big Data

ITG Software Engineering

Big Data: Tools and Technologies in Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Bringing Big Data to People

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

MySQL and Hadoop Big Data Integration

Google Bing Daytona Microsoft Research

A Performance Analysis of Distributed Indexing using Terrier

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

I/O Considerations in Big Data Analytics

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Oracle Database 12c Plug In. Switch On. Get SMART.

The Inside Scoop on Hadoop

Putting Apache Kafka to Use!

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Dominik Wagenknecht Accenture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop and MySQL for Big Data

COURSE CONTENT Big Data and Hadoop Training

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA What it is and how to use?

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Data processing goes big

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop: The Definitive Guide

CSE-E5430 Scalable Cloud Computing Lecture 2

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Hadoop Meets Exadata. Presented by: Kerry Osborne. DW Global Leaders Program Decemeber, 2012

Modernizing Your Data Warehouse for Hadoop

Big Data Management and Security

Complete Java Classes Hadoop Syllabus Contact No:

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA TRENDS AND TECHNOLOGIES

Workshop on Hadoop with Big Data

SQL on NoSQL (and all of the data) With Apache Drill

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Transcription:

Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015

Safe Harbor The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle.

Program Agenda 1 2 3 4 5 Hadoop Overview Data Lake How the technologies can work together Tools for Integration Q&A Exadata + Oracle Database Oracle Confidential Internal/Restricted/Highly Restricted 3

What Is Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

What is HDFS? The Hadoop Distributed File System HDFS is the primary storage system underlying Hadoop Fault tolerant, scalable, highly available Designed to be well-suited to distributed processing Splits large files into blocks Multiple copies stored on different disks on separate nodes Is superficially structured like a UNIX file system

A MapReduce (True distributed computing) Analogy Going From Estimates To Actuals

A MapReduce (True distributed computing) Example Putting The Analogy Into Practice

MapReduce Phases Map Each Map task usually works on a single input split Hadoop tries to run map tasks on the slave node that contains the stored HDFS data block (data locality) The input is presented to the Map phase as a key-value pair Shuffle and Sort Groups all the values together for each key (using the intermediate output data from all of the completed mappers) Sorts the keys Reduce The intermediate output of the Shuffle and Sort phase is the input to the Reduce phase The Reduce function (developer) generates the final output

What else is Hadoop Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume Apache Sqoop Apache Mahout Apache Whirr Apache Oozie Fuse-DFS Hue Plus Additional projects Impala BDR Sentry

Two Great Tastes That Go Great Together Things you can do (Big Oval) Hadoop Database Hadoop Database Things you can do cost-effectively (Small Oval) A. Hadoop can do some things a Database reasonably does not. B. Hadoop expands the amount of things you can do cost-effectively.

Two Great Tastes That Go Great Together

Data lake Oracle Confidential Internal/Restricted/Highly Restricted 12

Data Lake A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Focus in this Session Hadoop Relational Data in files Schema on read Simple programming model for large scale data processing Append only Sequential access of blocks Data organized for fast query Structured schema Complex programming models Read, write, delete, update Access specific record 14

Oracle Big Data Connectors R Client R Analytics Oracle R Advanced Analytics on Hadoop XQuery XML/XQuery Oracle XQuery on Hadoop Oracle Bigdata SQL (for eng. System) Optimized for Hadoop: Maximize parallelism Fast performance Analyze data on Hadoop using familiar client tools Data Load Oracle Loader for Hadoop Data Access Oracle SQL Connector for HDFS Oracle Data Integrator Knowledge Modules

Integrating data Hadoop and Oracle databases Transferring from Hadoop to Oracle RDBMS OSCH For CSV Text OSCH Hive Non Partition OLH For CSV text OLH Hive Non-partition OLH Data pump (offline) OLH Hive partition OLH Hive Partition parquet Sqoop For CSV text Sqoop Hive partition Sqoop Hive partition parquet Sqoop 2 - Hive partition parquet Transferring from Oracle RDBMS to Hadoop CopytoBDA (only on Engineered system platform) Oracle Data Integrator Oracle Golden gate Sqoop

Oracle Loader for Hadoop Parallel load, optimized for Hadoop Automatic load balancing Convert to Oracle format on Hadoop Save database CPU Text Avro Parquet Sequence files Load specific Hive partitions Hive Log files JSON Compressed files And more Kerberos authentication Load directly into In-Memory table

Oracle Loader for Hadoop Performance Extremely fast performance Sample numbers (on Oracle Engineered Systems) 4.4 TB/hour end-to-end (load + Hadoop process) Much higher than typical customer requirements Optimized for Oracle Big Data Appliance and Oracle Exadata: InfiniBand Connectivity

Oracle SQL Connector for HDFS Text Hive Compressed files OSCH OSCH OSCH OSCH External Table create table customer_address ( ca_customer_id number(10,0), ca_street_number char(10), ca_state char(2), ca_zip char(10)) organization external ( TYPE ORACLE_LOADER DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( addr1, addr2, addr3 )) Parallel query and load Load into database or query in place Access text or Hive over text Access compressed data Access specific Hive partitions Kerberos authentication

Oracle SQL Connector for HDFS Includes tool to generate external table Performance on Engineered Systems 15 TB/hour load time Query and load Oracle Data Pump files Binary file in Oracle format Uses less database CPU cycles during query/load

Oracle SQL Connector for Hadoop works with multiple versions Database versions (on any operating system*) 10.2.0.5 and greater 11.2.0.3 and greater 12c *Oracle SQL Connector for HDFS requires Hadoop client to be supported on the operating system Hadoop versions Apache Hadoop 2.x CDH 4.x (Cloudera) CDH 5.x (Cloudera) HDP 1.3 (Hortonworks) HDP 2.1 (Hortonworks) Certified by Oracle Oracle Oracle Hortonworks Hortonworks

Test Case Oracle Confidential Internal/Restricted/Highly Restricted 22

Data Generation Data was generated using a tool Built by the product development team. Used for benchmarking. Generates data within HDFS. Random data. The data-files were a combination of various data types such as int, float, varchar, date and timestamp. Copyright 2012, Oracle and/or its affiliates. All rights reserved. 23

Integrating data Oracle databases to Hadoop Things to consider while moving data from Hadoop to Oracle RDBMS. OSCH has proven consistently to be the best tool if you have to move massive amount of text data. OSCH moved 14 TB in an hour on Engineered systems. OLH is the best tool if you are dealing with large amount of partitioned tables on both Source and Target. Generally aligning # of files at the source to the # of partition at the target and with DOP gives better performance.

Oracle Loader for Hadoop Vs Oracle SQL connector for Hadoop Oracle Loader for Hadoop Oracle SQL connector for Hadoop

Oracle Data Integration for Big Data and Hadoop Comprehensive data integration platform designed to work with all data Data Replication Continuous data staging into Hadoop Data Transformation Pushdown processing in Hadoop Data Federation Query Hadoop SQL via JDBC Data Quality Fix quality at the source or invoke Machine Learning in Hadoop Metadata Management Lineage and Impact Analysis w/hadoop Oracle GoldenGate (Data Replication) Realtime Staging Oracle Data Integrator (Data Transformation) Pushdown Data Transformations Fast Load Synchronization Enterprise Data Quality (Profile, Cleanse, Match and De-duplicate) Enterprise Metadata Management (Lineage, Impact Analysis and Data Provenance) Data Service Integrator (Data Federation) Oracle Confidential 26

Oracle Big Data SQL Query All Data without Application Change or Data Conversion

Summary Fast, easy, integration of all data in your Big Data solution Oracle Big Data Connectors Oracle Big Data SQL (on Oracle Engineered Systems) Oracle Data Integrator Oracle Golden Gate Apache Sqoop

Additional information http://www.oracle.com/technetwork/database/bigdataappliance/oracle-bigdatalite-2104726.html http://www.oracle.com/technetwork/database/bigdataappliance/overview/index.html