How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)



Similar documents
A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Oracle Big Data SQL Technical Update

Constructing a Data Lake: Hadoop and Oracle Database United!

Comprehensive Analytics on the Hortonworks Data Platform

Oracle Database 12c Plug In. Switch On. Get SMART.

Luncheon Webinar Series May 13, 2013

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Using RDBMS, NoSQL or Hadoop?

I/O Considerations in Big Data Analytics

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Oracle Big Data Essentials

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Dominik Wagenknecht Accenture

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Architecting for the Internet of Things & Big Data

Oracle Big Data Fundamentals Ed 1 NEW

An Oracle White Paper June Oracle: Big Data for the Enterprise

<Insert Picture Here> Big Data

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

HDP Hadoop From concept to deployment.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Oracle Big Data Building A Big Data Management System

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Safe Harbor Statement

Big Data Analytics - Accelerated. stream-horizon.com

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Connecting Hadoop with Oracle Database

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

An Oracle White Paper October Oracle: Big Data for the Enterprise

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Hadoop and Map-Reduce. Swati Gore

Information Builders Mission & Value Proposition

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

An Oracle White Paper September Oracle: Big Data for the Enterprise

Dell In-Memory Appliance for Cloudera Enterprise

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Putting Apache Kafka to Use!

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Hadoop Ecosystem B Y R A H I M A.

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Oracle Big Data Strategy Simplified Infrastrcuture

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Meets Exadata. Presented by: Kerry Osborne. DW Global Leaders Program Decemeber, 2012

Ganzheitliches Datenmanagement

Upcoming Announcements

The Future of Data Management

How to Choose Between Hadoop, NoSQL and RDBMS

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Oracle Big Data Handbook

Trafodion Operational SQL-on-Hadoop

Case Study : 3 different hadoop cluster deployments

Please give me your feedback

Big Data Analytics Nokia

Big Data Are You Ready? Thomas Kyte

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Using Kafka to Optimize Data Movement and System Integration. Alex

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing mcombs@veristorm.com

Workshop on Hadoop with Big Data

ENZO UNIFIED SOLVES THE CHALLENGES OF OUT-OF-BAND SQL SERVER PROCESSING

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Moving From Hadoop to Spark

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Big Data Management and Security

Getting Real Real Time Data Integration Patterns and Architectures

High Performance Data Management Use of Standards in Commercial Product Development

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

NoSQL Data Base Basics

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Sentimental Analysis using Hadoop Phase 2: Week 2

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

Bringing Big Data to People

Modernizing Your Data Warehouse for Hadoop

Hadoop Big Data for Processing Data and Performing Workload

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HDP Enabling the Modern Data Architecture

So What s the Big Deal?

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Tungsten Replicator, more open than ever!

Transcription:

Day with Development Master Class Big Data Management System DW & Big Data Global Leaders Program Jean-Pierre Dijcks Big Data Product Management Server Technologies

Part 1 Part 2 Foundation and Architecture of a BDMS Streaming & Batch Data Ingest and Tooling 3

Storing Data in HDFS and its relation to performance Space Usage vs. Type Complexity Data types like JSON are popular, especially when exchanging data or when capturing messages Simple JSON documents can be left in their full state If the documents are deeply nested it pays to flatten these upon ingest The consequence is of course an expansion of the data size But, things like joins and typical analytics on Hadoop will perform better on simpler objects In regulated industries, it pays to keep the original JSON as well as the decomposed structure. Ensure to compress the original and save away in a source directory 4

General Classification Stream Flume GoldenGate Kafka Batch HDFS Put (S)FTP NFS wget curl & HDFS Put wget curl & (S)FTP NFS Push Pull 5

Batch Loading data into HDFS Pushing Data 6

Don t do this HDFS in this case should replace any additional filers SAN or NAS filers B 7

Instead try to do this Add either an FTP or a Hadoop client to the source Major Benefits from this simple change: Reduces amount of NAS/SAN storage => COST savings Reduces complexity Reduces data proliferation (improved security) Hadoop or FTP client go here B 8

Using Hadoop Client to Load Data Source Server Big Data Appliance HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Linux FS HDFS 9

Using Hadoop Client to Load Data Source Server Big Data Appliance HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Enables direct HDFS writes without intermediate file staging on Linux FS Easy to scale: Initiate concurrent puts for multiple files HDFS will leverage multiple target servers and ingest faster Linux FS HDFS 10

FTP-ing onto Local Linux File System Basic Flow Install FTP server on BDA node(s) 1. FTP files onto local Linux FS on BDA Something like /u12 Some FTP clients can write to WebHDFS 2. Use HDFS Put to load data from Linux FS into HDFS 3. Remove files from Linux FS 4. Repeat Big Data Appliance HDFS nodes Linux FS HDFS 11

FTP Managing Space for Linux and HDFS on Ingest nodes You cannot (today) de-allocate a few disks from HDFS on BDA So, you should therefore: Set a quota on how large HDFS can grow on the ingest nodes Set a quota at the linux levels to regulate space Sizing depends on The ingest and cleanup schedule The ingest size Peak ingest sizes Linux FS Big Data Appliance HDFS HDFS nodes 12

FTP High Availability Source Server Big Data Appliance HDFS nodes Run multiple FTP servers on multiple BDA nodes Provide a load balancer like HA Proxy (included with Oracle Linux) HA Proxy Linux FS HDFS 13

Batch Loading data into HDFS Pulling Data 14

Pulling data with wget or curl and Hadoop Client Big Data Appliance Source Server HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Use wget or curl to initiate data transfer and load Linux FS HDFS 15

Pulling data with wget or curl and Hadoop Client Big Data Appliance Source Server HDFS nodes Client Pipe straight through to HDFS put Can use FTP/HTTP as well All observations from previous section apply HDFS Put issued from Hadoop Client on SRC server Use wget or curl to initiate data transfer and load Linux FS HDFS 16

Grabbing Data from Databases ORCL Big Data SQL - Copy to BDA - Table Acces to BDA Big Data Appliance HDFS nodes SQL Object Based Sqoop Change Capture ORCL GoldenGate 17

1) Avoid any additional external staging systems as these system reduce scalability 2) Opt for tools and methods that write directly into HDFS like HDFS put 18

Moving Mainframe data into HDFS Batch Files 19

Using Golden Gate to Replicate from Mainframe Mainframe GG can replicate from MF Database Big Data Appliance HDFS nodes GoldenGate Apply directly into HDFS or HBase 20

Mainframe Data Mainframe General Assumption: Any data collection on MF needs to be non-intrusive due to security and cost (MIPS) reasons. Existing jobs typical generate files SyncSort SyncSort is one of the leading MF tools FTP (via ETL Tools) from MF to recipient systems

Using file transfers to move from Mainframe Mainframe SyncSort Follow the push and pull mechanisms discussed earlier Big Data Appliance HDFS nodes Pull or Push Data 22

Using file transfers to move from Mainframe Most MF files will be EBCDIC format and need to be converted to ASCII 1. Land on local disk (Linux FS) 2. Put files into HDFS 3. Convert from EBCDIC to ASCII using standard tooling (ex: SyncSort) on Hadoop 4. Optional: Copy ASCII file and compress together with original EBCDIC files 5. Archive original (with ASCII file if done step 3) file 6. Delete original files from Linux FS 7. Repeat Big Data Appliance HDFS nodes 23

1) Keep Transfer SW as simple as possible 2) Move as much processing of files from MF to BDA, and use proven tools for EBCDIC to ASCII conversions 24

Streaming Data Product Approach 25

Various tooling options Apache Kafka seems to be a (new) favorite Oracle GoldenGate just added a big data option enabling streaming from GG sources into HDFS and Hive for example Oracle Event Processing enables a rich developer environment and low latency stream processing See the representative documentation for details, usage and restrictions Note the distinction between Transport and Processing OEP is an example of stream processing, whereas Kafka is stream transport 26

What should I use now that I am streaming data? Chances are you have no choice Your sources are publishing data onto a messaging bus Your organization already has a streaming system in place Nevertheless the following section will attempt to clarify this question 27

Apache Flume (NG) Currently one of the most common tools, with many pre-built sources and sinks, some other interesting aspects: Scalable with fan-in and fan-out capabilities Direct write into HDFS Can evaluate simple in stream actions Part of CDH and supported as such Use for streaming when: Simple actions need to be evaluated Reasonable latency is ok Scalability is key You are using this for other data sources 28

Oracle Event Processing Low latency, with easy to use visual modeling environment and its own DSL called Continuous Query Language (CQL): Available for data center as well as embedded, enabling large fan-in setups for IoT like systems Direct write into HDFS as well as Oracle NoSQL DB Can evaluate complex in stream actions, leveraging readings from NoSQL, Oracle Database and can leverage for example Oracle Spatial Focuses on very low latency and complex actions Use for streaming when: You need low latency, embedded and complex actions, expanding to IoT You are looking for mature tooling, an easy to use DSL 29

Apache Kafka Highly scalable messaging system (Linked-In): Pub-Sub mechanism Distributed and highly resilient Highly scalable even when serving a mix of batch and online consumers No action evaluation capabilities (needs external tooling for this) Use for streaming when: You are looking for a scalable messaging system You are dealing with very high volumes You can code a number of things when needed 30

Conclusion Use Flume for specific use cases: Rolling log files Why? Flume has a lot of specific code available to deal with a large number of log formats and writes directly into HDFS Use OEP when you need event processing Processing: Complex rules are applied across the spectrum You need embedded systems (standardize) Use Kafka when: You have the skills or can acquire them Transportation: You are looking for massive scale queuing / streaming 31

Streaming data into HDFS Pushing Data 32

Flume Streaming logs to HDFS Big Data Appliance HDFS nodes Webserver Flume Log4j Client Flume Agent Flume HDFS Sink Note, Flume enables simple event processing as well as direct movement into HDFS or other sinks 33

Flume Streaming logs to HDFS Flume Concepts Client captures and transmits events to the next hop Agent Agents can write to other agents through sinks Flume Client source channel Flume Sink Source receives events and delivers these to one or more channels Channel receives the event, which gets drained by sinks Sink either finishes a flow (ex. HDFS sink) or transmits to the next agent 34

Flume Streaming logs to HDFS Splitting Streams / Multi-Consumer Webserver Flume Log4j Client Flume Source Flume Agent Flume Channel DR Site Flume Channel Production Same data flows to both HDFS clusters Flume HDFS Sink Flume HDFS Sink HDFS nodes HDFS nodes 35

Standardize as much as possible towards a single technology for ease of management (see next topic) 36

Landing Streaming Data 37

Land in HDFS of NoSQL? Driven by query requirements: NoSQL nodes HDFS nodes Do I need to see individual transactions as they land? Do I need key based access in real-time? Can I wait for HDFS to write to disk? 1 1 2 Stream 1 2 38

The need for a separate NoSQL store does complicate architectures, so only do this if required 39

Streaming Some Example Architectures 40

OEP NoSQL Database Hadoop Big Data Appliance HDFS nodes Embedded OEP on Sensors OEP on GTW Devices NoSQL DB to catch data and deliver Models to OEP 41

OEP NoSQL Hadoop OEP instances are not linked and act upon a partition of inputs Embedded OEP on Sensors OEP on GTW Devices Add Coherence distributed memory grid to enable data sharing between all OEP instances 42

Flume Kafka Hadoop Big Data Appliance HDFS nodes Flume HDFS Sink Kafka Cluster Flume Client & Agents 43

Future State? Kafka Hadoop Big Data Appliance HDFS nodes Kafka Consumers Kafka Cluster Kafka Producers 44

The tooling for Streaming is in flux, Kafka is looking like a thing that is going to stick around When in doubt, look at vendor options as they are often better documented and supported 45

HDFS data into Databases 46

From HDFS to Database Big Data Appliance HDFS nodes Oracle Big Data SQL: Enables transparent SQL access to the end user across BDA + Exadata Covered in the next section!! Big Data Connectors - Oracle SQL Connector to HDFS - Oracle Loader for Hadoop ORCL Sqoop 47

A few comments Sqoop is widely used, but is also widely complained about Handle with care, know what you are doing Big Data Connectors: Better performance than Sqoop, preferred option for Oracle Database loads Oracle Data Integrator when licensing Big Data Connectors on Big Data Appliance ODI is included as a restricted use license. This applies when all transformations are done on BDA (none on Oracle DB for example) 48

Use (ETL) tools where you can as they simplify implementation and enable you to shift implementation paradigms more quickly 49

50