Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Similar documents
Practical Hadoop by Example

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Real Time Big Data Processing

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

NoSQL for SQL Professionals William McKnight

Oracle Big Data SQL Technical Update

BIG DATA TRENDS AND TECHNOLOGIES

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

HDP Hadoop From concept to deployment.

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data With Hadoop

The Future of Data Management

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Apache Hadoop: Past, Present, and Future

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

How To Handle Big Data With A Data Scientist

Introduction to Apache Cassandra

So What s the Big Deal?

Bringing Big Data to People

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Moving From Hadoop to Spark

Big Data and Market Surveillance. April 28, 2014

Tap into Hadoop and Other No SQL Sources

Large scale processing using Hadoop. Ján Vaňo

Modernizing Your Data Warehouse for Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

How To Scale Out Of A Nosql Database

Hadoop & Spark Using Amazon EMR

<Insert Picture Here> Big Data

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

HDP Enabling the Modern Data Architecture

Open source Google-style large scale data analysis with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data Are You Ready? Thomas Kyte

Dominik Wagenknecht Accenture

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Luncheon Webinar Series May 13, 2013

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Upcoming Announcements

Open source large scale distributed data management with Google s MapReduce and Bigtable

Getting Started Practical Input For Your Roadmap

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Hadoop Ecosystem B Y R A H I M A.

Information Builders Mission & Value Proposition

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data on Microsoft Platform

Big Data and Industrial Internet

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data? Definition # 1: Big Data Definition Forrester Research

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

A Brief Outline on Bigdata Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Building Your Big Data Team

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Data Analyst Program- 0 to 100

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Next-Generation Cloud Analytics with Amazon Redshift

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data: Tools and Technologies in Big Data

TUT NoSQL Seminar (Oracle) Big Data

Big Data Technologies Compared June 2014

Integrating Big Data into the Computing Curricula

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Infrastructure at Spotify

#TalendSandbox for Big Data

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Building Scalable Big Data Pipelines

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Self-service BI for big data applications using Apache Drill

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Transcription:

Hadoop for Oracle database professionals Alex Gorbachev Calgary, AB September 2013

Alex Gorbachev Chief Technology Officer at Pythian Blogger Cloudera Champion of Big Data OakTable Network member Oracle ACE Director Founder of BattleAgainstAnyGuess.com Founder of Sydney Oracle Meetup IOUG Director of Communities EVP, Ottawa Oracle User Grou 2

Who is Pythian? 15 Years of Data infrastructure management consulting 170+ Top brands 6000+ databases under management Over 200 DBA s, in 26 countries Top 5% of DBA work force, 9 Oracle ACE s, 2 Microsoft MVP s Oracle, Microsoft, MySQL, Netezza, Hadoop, MongoDB, Oracle Apps, Enterprise Infrastructure

When to Engage Pythian? LOVE YOUR DATA Profit Strategic upside value from data Value of Data Loss Tier 3 Data Local Retailer Tier 2 Data ecommerce Tier 1 Data Health Care Impact of an incident, whether it be data loss, security, human error, etc.

Agenda What is Big Data? What is Hadoop? Big Data use cases Practical examples Hadoop in DW How to get started

What is Big Data?

Doesn t Matter. We are here to discuss data architecture and use cases. Not define market segments.

What Does Matter? Some kinds data are a bad fit for RDBMS. Some problems are a bad fit for RDBMS. We can call them BIG if you want. Data Warehouses have always been BIG.

Given enough skill and money Oracle can do anything. Lets talk about cost efficient solutions.

When RDBMS Makes no Sense? Storing images and video Processing images and video Storing and processing other large files PDFs, Excel files Processing large blocks of natural language text Blog posts, job ads, product descriptions Processing semi-structured data CSV, JSON, XML, log files Sensor data

When RDBMS Makes no Sense? Ad-hoc, exploratory analytics Integrating data from external sources Data cleanup tasks Very advanced analytics (machine learning)

New Data Sources Blog posts Social media Images Videos Machine logs Sensors Location services Mobile devices They all have large potential value But they are awkward fit for traditional data warehouses

Big Problems with Big Data It is: Unstructured Unprocessed Un-aggregated Un-filtered Repetitive Low quality And generally messy. Oh, and there is a lot of it.

Technical Challenges Storage capacity Storage throughput Pipeline throughput Processing power Parallel processing System Integration Data Analysis Scalable storage Massive Parallel Processing Ready to use tools

Big Data Solutions Real-time transactions at very high scale, always available, distributed Analytics and batch-like workload on very large volume often unstructured Relaxing ACID rules Atomicity Consistency Isolation Durability Example: eventual consistency in Cassandra Massively scalable Throughput oriented Sacrifice efficiency for scale Hadoop is most industry accepted standard / tool

What is Hadoop?

Hadoop Principles Bring Code to Data Share Nothing

Hadoop in a Nutshell Replicated Distributed Big- Data File System Map-Reduce - framework for writing massively parallel jobs

HDFS architecture simplified view Files are split in large blocks Each block is replicated on write Files can be only created and deleted by one client Uploading new data? => new file Append supported in recent versions Update data? => recreate file No concurrent writes to a file Clients transfer blocks directly to & from data nodes Data nodes use cheap local disks Local reads are efficient

HDFS design principles

Map Reduce example histogram calculation

Map Reduce pros & cons Advantages Very simple Flexible Highly scalable Good fit for HDFS mappers read locally Fault tolerant Pitfalls Low efficiency Lots of intermediate data Lots of network traffic on shuffle Complex manipulation requires pipeline of multiple jobs No high-level language Only mappers leverage local reads on HDFS

Exadata SmartScan vs. MapReduce

Main components of Hadoop ecosystem Hive HiveQL is SQL like query language Generates MapReduce jobs Pig data sets manipulation language (like create your own query execution plan) Generates MapReduce jobs Mahout machine learning libraries Generates MapReduce jobs Oozie workflow scheduler services Sqoop transfer data between Hadoop and relational database

MapReduce is SLOW! Speed through massive parallelization

Non-MR processing on Hadoop HBase columnar-oriented key-value store (NoSQL) SQL without Map Reduce Impala (Cloudera) Drill (MapR) Phoenix (Salesforce.com) Hadapt (commercial) Shark Spark in-memory analytics on Hadoop Platfora (commercial) In-memory analytics + visualization & reporting tool

Hadoop Benefits Reliable solution based on unreliable hardware Designed for large files Load data first, structure later Designed to maximize throughput of large scans Designed to leverage parallelism Designed to scale Flexible development platform Solution Ecosystem

Hadoop Limitations Hadoop is scalable but not fast Some assembly required Batteries not included Instrumentation not included either DIY mindset (remember MySQL?) Simplistic security models Commercial distributions are not free

How much does it cost? $300K DIY on SuperMicro 100 data nodes 2 name nodes 3 racks 800 Sandy Bridge CPU cores 6.4 TB RAM 600 x 2TB disks 1.2 PB of raw disk capacity 400 TB usable (triple mirror) Open-source s/w, maybe commercial distribution

Hadoop vs. Relational Database Cool Old Load first, structure later Structure first, load later Cheap hardware Enterprise grade hardware DIY Repeatable solutions Flexible data store Efficient data store Effectiveness via scale Effectiveness via efficiency Petabytes Terabytes 100s 1000s cluster nodes Dozens of nodes (maybe)

Big Data Use Cases

Use Cases for Big Data Top-line contributors Analyze customer behavior Optimize ad placements Customized promotions and etc Recommendation systems Netflix, Pandora, Amazon New products and services Prismatic, smart home

Use Cases for Big Data Top-line contributors Analyze customer behavior Optimize ad placements Customized promotions and etc Recommendation Netflix, Pandora, Amazon New systems products and services Prismatic, smart home

Use Cases for Big Data Top-line contributors Analyze customer behavior Optimize ad placements Customized promotions and etc Recommendation systems Netflix, Pandora, Amazon New products and services Prismatic, smart home Bottom-line contributors Cheap archives storage ETL layer transformation engine, data cleansing

Typical Initial Use-Cases for Hadoop in modern Enterprise IT Transformation engine (part of ETL) Scales easily Inexpensive processing capacity Any data source and destination Data Landfill Stop throwing away any data Don t know how to use data today? Maybe tomorrow you will Hadoop is very inexpensive but very reliable

Advanced: Data Science Platform Data warehouse is good when questions are known, data domain and structure is defined Hadoop is great for seeking new meaning of data, new types of insights Unique information parsing and interpretation Huge variety of data sources and domains When new insights are found and new structure defined, Hadoop often takes place of ETL engine Newly structured information is then loaded to more traditional datawarehouses (still today)

Practical Hadoop Examples

Pythian Internal Hadoop Use OCR of screen video capture from Pythian privileged access surveillance system Input raw frames from video capture Map-Reduce job runs OCR on frames and produces text Map-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screen Other Map-Reduce jobs mine text (and keystrokes) for insights Credit Cart patterns Sensitive commands (like DROP TABLE) Root access Unusual activity patterns Merge with monitoring and documentation systems

RFID sensors in retail Hundreds of stores Each retail location has tens of thousands items Each item has an RFID tag Position of each item is detected every few second Having access to all this information how can we optimize retail operations and sales? Item placement analytics A/B testing Items correlations and etc.

Smart utility metering Data sources Meters Past weather data Weather forecasts Unusual combinations like cars traffic data, population data Optimize utility distributions systems Predict spikes in utilization Detect leaks and predict emergencies

Hadoop for Oracle DBAs? alert.log repository listener.log repository Statspack/AWR/ASH repository trace repository DB Audit repository Web logs repository SAR repository SQL and execution plans repository Database jobs execution logs

Hadoop in the Data Warehouse

ETL for Unstructured Data Logs Web servers, app server, clickstreams Flume Hadoop Cleanup, aggregation Long term storage DWH BI, batch reports

ETL for Structured Data OLTP Oracle, MySQL, Informix Sqoop, Perl, etc. Hadoop Transformation aggregation Long term storage DWH BI, batch reports

Bring the World into Your Datacenter

Rare Historical Report

Find Needle in Haystack

Getting started with Hadoop Your own Hadoop playground

Hadoop in the cloud Amazon Elastic Map Reduce Amazon has automated spinning up and down Hadoop clusters based on EC2 nodes. Using S3 as long term data storage instead of HDFS. Remember that data locality is lost with S3. Whirr deploy on Amazon EC2 or Rackspace Cloud Servers http://whirr.apache.org Deploy manually Example - http://www.pythian.com/blog/impala-ec2-tips-and-trap/ Joyent new kid on the block http://joyent.com/products/joyent-cloud/features/hadoop

Your own Hadoop cluster VM on your laptop Cloudera CDH4 VM - https://ccp.cloudera.com/display/support/demo+vms Hortonworks Sandbox - http://hortonworks.com/products/hortonworks-sandbox Anything goes Reuse decommissioned hardware at work Reuse old desktops Cheapest layer2 switch with good switching capacity matters but no need for fancy features, VLANs and etc)

Controversy Your Data is NOT as BIG as you think

Thank you & Q&A To contact us sales@pythian.com hr@pythian.com gorbachev@pythian.com 1-866-PYTHIAN To follow us http://www.pythian.com/blog/ http://www.facebook.com/pages/the-pythian-group/ http://twitter.com/pythian http://twitter.com/alexgorbachev http://www.linkedin.com/company/pythian 5