<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Similar documents
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

The Pros and Cons of Data Warehouse Appliances

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

Hadoop and Map-Reduce. Swati Gore

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Handle Big Data With A Data Scientist

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data and Its Impact on the Data Warehousing Architecture

MapReduce, Hadoop and Amazon AWS

Data-Intensive Computing with Map-Reduce and Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Apache Hadoop. Alexandru Costan

Large scale processing using Hadoop. Ján Vaňo

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

NoSQL and Hadoop Technologies On Oracle Cloud

Parallel Processing of cluster by Map Reduce

Big Data Technologies Compared June 2014

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop

Performance and Scalability Overview

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Introduction to Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Implement Hadoop jobs to extract business value from large and varied data sets

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Open source large scale distributed data management with Google s MapReduce and Bigtable

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Apache HBase. Crazy dances on the elephant back


Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Virtualizing Apache Hadoop. June, 2012

Cost-Effective Business Intelligence with Red Hat and Open Source

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Manifest for Big Data Pig, Hive & Jaql

HadoopRDF : A Scalable RDF Data Analysis System

Apache Hadoop new way for the company to store and analyze big data

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

A very short Intro to Hadoop

How To Use Hp Vertica Ondemand

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

CS54100: Database Systems

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Using distributed technologies to analyze Big Data

MapReduce. Tushar B. Kute,

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Agile Business Intelligence Data Lake Architecture

Big Data on Microsoft Platform

Hadoop IST 734 SS CHUNG

Keywords: Big Data, HDFS, Map Reduce, Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Architectures for Big Data Analytics A database perspective

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

EII - ETL - EAI What, Why, and How!

Hadoop Technology HADOOP CLUSTER

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Can Storage Fix Hadoop

Hadoop & its Usage at Facebook

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

MAPREDUCE Programming Model

Oracle Big Data SQL Technical Update

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Transcription:

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management

Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices, choices Q&A

Business Drivers Changing IT More More data More users More analysis More uptime

Business Drivers Changing IT More Faster Performance Startup Development Time to Market

Business Drivers Changing IT More Faster Cheaper Hardware Fewer Staff Less Power Less Cooling

Some Reactions to these Changes Open Source Software Grid Computing on Commodity Hardware Virtualization The emergence of the Cloud Democratization of IT Always-on Systems Democratization of BI Operational Data Warehousing Vertical Solutions Etc

The Cloud

Some Contradictions More Uptime Better Performance Less Hardware Fewer Staff Open Source is cheap but less robust Cloud and Virtualization are slower MPP Clusters (Hadoop) need more HW Hadoop requires lots of programming Choose wisely

What is Hadoop?

Hadoop Architecture Hadoop is a shared nothing compute architecture that: Is open source (as opposed to Google s implementation) Is a data processing architecture Processes data in parallel to achieve its performance Processes on very large clusters (100 1000s of nodes) of cheap commodity hardware Automatically deals with node failures and redeploys data and programs as needed Some say is it very cool Cloud Hadoop Hadoop can run in a (private) cloud

High-level Hadoop Architecture Components: Hadoop client is your terminal into the Hadoop cluster Initiates processing, no actual code is run here NameNode manages the metadata and access control Single node, often redundant with exactly one secondary namenode JobTracker hands out the tasks to slaves (query coordinator) Slaves are called TaskTrackers Data Nodes store data and do the processing Data is redundantly stored across these data nodes Hadoop Distributed File System (HDFS) stores input and output data

A typical HDFS cluster Download periodic checkpoints Passive secondary NameNode (no automatic failover) NameNode Clients or programs communicate with namenode about location of data (where to read from and write to) Client / Program Holds metadata about where data lives JobTracker Query Coordinator Hold active and passive data DataNodes / TaskTrackers Direct interaction with datanodes for reading and writing of data to the nodes

Loading Data (simplified) Client / Program NameNode Request data placement 1 2 Receive data placement info Buffer Data 4 Confirmation on both writes DataNodes 3 Write data chunk to both primary and secondary node of the cluster

Querying Data (simplified) NameNode Holds metadata about where data lives Client / Program Aggregate results Data Location Aggregated Results JobTracker Parcels out assignments and aggregates results DataNodes / TaskTrackers Execute mappers and reducers

What is the typical use case? The common use cases cited are things like: Generating inverted indexes (Text Searching) Non-relational data (Log files, web click etc.) analysis on extreme volumes Some types of ETL processing What it does not do: No database (neither relational nor columnar nor OLAP) Not good at real time or short running jobs Not deal well with real time or even frequent/regular updates to the data on the cluster Not very easy to use (developers only please) as it is pure coding and debugging (look at things like Cascading etc )

MapReduce Programs MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. [ MapReduce is: The program building blocks for an Hadoop cluster Reducers consume data provided by mappers Many mappers and reducers are running in parallel Written in many languages (Perl, Python, Java etc)

MapReduce Example A very common example to illustrate MapReduce is a word count In a chunk of text count all the occurrences of a word, either specific words or all words This functionality is written in a program executed on the cluster delivering a name value pair with a total word count as the result

MapReduce Example Input Reader The cloud is water vapor. But is water vapor useful? But it is! Map process Map process the, 1 cloud, 1 is, 1 water, 1 vapor, 1 but, 1 is, 1 water, 1 vapor, 1 useful, 1 but, 1 it, 1 is, 1 Partition, Compare, Redistribute the, 1 cloud, 1 is, 1 is, 1 is, 1 but, 1 but, 1 Water,1 vapor, 1 water, 1 vapor, 1 it, 1 useful, 1 http://en.wikipedia.org/wiki/mapreduce

MapReduce Example the, 1 cloud, 1 is, 1 is, 1 is, 1 but, 1 but, 1 water,1 vapor, 1 water, 1 vapor, 1 it, 1 useful, 1 Reducer Reducer the, 1 cloud, 1 is, 3 but, 2 water, 2 vapor, 2 it, 1 useful, 1 Consolidate and Write the, 1 cloud, 1 water, 2 is, 3 but, 2 vapor, 2 it, 1 useful, 1

In the eye of the Beholder There is a lot of confusion about what Hadoop is or does in detail so, when Hadoop comes up there is a mismatch between the perceived capabilities and the real capabilities: Hadoop is talked about as a simple solution Hadoop is talked about as being low cost A data warehouse has a lot of data so Hadoop should work Massively parallel capabilities will solve my performance problems Everyone uses Hadoop

Myths and Realities Hadoop is talked about as a simple solution But you need expert programmers to make anything work It is Do It Yourself parallel computing (no optimizer, no stats, no smarts) Only works in a development environment with few developers and a small set of known problems Hadoop is talked about as being low cost Yes, it is open source with all its pros and cons And don t forget the cost of a savvy developer or six A data warehouse has a lot of data so Hadoop should work Maybe, but probably not. Hadoop does not deal well with continues updates, ad-hoc queries, many concurrent users or BI Tools Only programmers can get real value out of Hadoop, not your average business analyst

Myths and Realities Massively Parallel Processing will solve my performance problems Well maybe or maybe not The appeal of Hadoop is the ease of scaling to thousands of node not raw performance In fact benchmarks have shown a relational DB to be faster than Hadoop Not all problems benefit from the capabilities of the Hadoop system; Hadoop does solve some problems for some companies Everyone uses Hadoop Well, mostly the internet focused business, and maybe a few hundred all in all And yes, they use it for specific static workloads like reverse indexing (internet search engines) and pre-processing of data And do you have the programmers in-house that they have?

Myths and Realities But If you have the programmers / knowledge If you have the large cluster (or can live with a cloud solution) You can create a very beneficial solution to a Big Data problem as part of your infrastructure

Oracle and/or Hadoop Run MapReduce within an Oracle Database is very easy Using Hadoop and then feeding the data to Oracle for further analysis is more common and quite easy Integrating (e.g. a single driving site) leveraging both frameworks is doable but more involved

Using Oracle instead of Hadoop Running MapReduce within the Database Oracle Database 11g Table Table Reduce Map Reduce Map Code

Using Oracle instead of Hadoop Running MapReduce within the Database HDFS Datafile_part_1 HDFS Datafile_part_2 Oracle Database 11g External Table Table HDFS Datafile_part_m Fuse HDFS Datafile_part_n HDFS Datafile_part_x Map Reduce

Using Oracle Next to Hadoop RDBMS is a Target for Hadoop Processing HDFS Output_part_1 HDFS Output_part_2 Oracle Database 11g Join, filter, transform data using Oracle DB HDFS Output_part_m Fuse HDFS Output_part_n HDFS Output_part_x External Table

Running Oracle with Hadoop Integrated Solution HDFS Output_part_1 HDFS Output_part_2 HDFS Output_part_m HDFS Output_part_n Namenode Metadata Results Results Results Queue Results Results Oracle Database 11g Processing data using Table Functions after producing results in Hadoop Read Process Process Process Controller Table Function directing Hadoop jobs

Starting the Processing Table Table Function Invocations 1 QC 2 Job Monitor Asynchronous 3 Launcher Synchronous 4 Hadoop Mappers 6 De-queue Queue En-queue 5

Monitoring the Hadoop Side Table Function Invocations Job Monitor 8 Asynchronous 7 Launcher Synchronous Hadoop Mappers 6 De-queue Queue

Processing Stops Table Function Invocations Job Monitor 9 Asynchronous Queue

Do you need Hadoop? Some Considerations Think about the data volume you need to work with What kind of data are you working with Structured? Un/semi structured? Think about the application of that data (e.g. what workload are you running) Who is the audience? Do you need to safeguard every bit of this information?

Size Matters Poor query response Can't support advanced analytics Inadequate data load speed Can't scale to large data volumes Cost of scaling up is too expensive Poorly suited to real-time or on demand workloads Current platform is a legacy we must phase out Can't support data modeling we need We need platform that supports mixed workloads 45% 40% 39% 37% 33% 29% 23% 23% 21% Source: TDWI Next Generation Data Warehouse Platforms Report, 2009

Size Matters More than 10 TB 17% 34% 3-10 TB 19% 25% 1-3 TB 18% 21% 500 GB - 1 TB 12% 20% Less than 500 GB 5% 21% In 3 Years Today Source: TDWI Next Generation Data Warehouse Platforms Report, 2009

Workload Matters Poor query response Can't support advanced analytics Inadequate data load speed Can't scale to large data volumes Cost of scaling up is too expensive Poorly suited to real-time or on demand workloads Current platform is a legacy we must phase out Can't support data modeling we need We need platform that supports mixed workloads 45% 40% 39% 37% 33% 29% 23% 23% 21% Source: TDWI Next Generation Data Warehouse Platforms Report, 2009

Do you need Hadoop Part 1 Yes, as a Data Processing Engine If you have a lot (couple of 100TBs) of unstructured data to sift through, you probably should investigate it as a processing engine If you have very processing intensive workloads on large data volumes Run those ETL like processes every so often on new data Process that data and load the valuable outputs into an RDBMS Use the RDBMS to share the results combined with other data with the users

Do you need Hadoop Part 1 Yes, as a Data Processing Engine HDFS Output_part_1 HDFS Output_part_2 Oracle Database 11g Join, filter, transform data using Oracle DB HDFS Output_part_m Fuse HDFS Output_part_n HDFS Data Processing Stage External Table Data Warehousing Stage

Do you need Hadoop Part 2 Not really Overall size is somewhere around 1 10TB Your data loads are done with flat files You need to pre-process those files before loading them The aggregate size of these files is manageable: Your current PERL scripts work well You do not see bottlenecks in processing the data The work you are doing is relatively simple Basic string manipulations Some re-coding

Conclusion Design a Solution for YOUR Problem Understand your needs and your target audience Choose the appropriate solution for the problem Don t get pigeonholed into a single train of thought

Need More Information? Read this (or just Google around): http://hadoop.apache.org http://database.cs.brown.edu/sigmod09/benchmarkssigmod09.pdf http://www.cs.brandeis.edu/~cs147a/lab/hadoopcluster/ http://blogs.oracle.com/datawarehousing/2010/01/integr ating_hadoop_data_with_o.html http://blogs.oracle.com/datawarehousing/2009/10/indatabase_map-reduce.html

Questions