Practical Hadoop by Example



Similar documents
Hadoop. for relational database professionals. Alex Gorbachev 17-May-2013 Atlanta, GA

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Constructing a Data Lake: Hadoop and Oracle Database United!

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

Big Data With Hadoop

Hadoop Ecosystem B Y R A H I M A.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop and Map-Reduce. Swati Gore

Oracle Big Data SQL Technical Update

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Implement Hadoop jobs to extract business value from large and varied data sets

Apache Hadoop: Past, Present, and Future

Big Data Course Highlights

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

NoSQL for SQL Professionals William McKnight

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Moving From Hadoop to Spark

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop Job Oriented Training Agenda

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HDP Hadoop From concept to deployment.

Real Time Big Data Processing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The Future of Data Management

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data processing goes big

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Hadoop & Spark Using Amazon EMR

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Self-service BI for big data applications using Apache Drill

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

How To Handle Big Data With A Data Scientist

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

The Future of Data Management with Hadoop and the Enterprise Data Hub

MapR: Best Solution for Customer Success

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Qsoft Inc

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

A Brief Outline on Bigdata Hadoop

Self-service BI for big data applications using Apache Drill

So What s the Big Deal?

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Google Bing Daytona Microsoft Research

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Dominik Wagenknecht Accenture

ITG Software Engineering

Hadoop IST 734 SS CHUNG

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Using distributed technologies to analyze Big Data

Luncheon Webinar Series May 13, 2013

Big Data: Tools and Technologies in Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

<Insert Picture Here> Big Data

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Introduction to Apache Cassandra

Large scale processing using Hadoop. Ján Vaňo

HDP Enabling the Modern Data Architecture

Building Your Big Data Team

Tap into Hadoop and Other No SQL Sources

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Integrating Big Data into the Computing Curricula

Internals of Hadoop Application Framework and Distributed File System

Peers Techno log ies Pv t. L td. HADOOP

Big + Fast + Safe + Simple = Lowest Technical Risk

CitusDB Architecture for Real-Time Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data in a Relational World Presented by: Kerry Osborne JPMorgan Chase December, 2012

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Native Connectivity to Big Data Sources in MSTR 10

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Open source Google-style large scale data analysis with Hadoop

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Management and Security

Building Scalable Big Data Pipelines

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data Infrastructure at Spotify

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Workshop on Hadoop with Big Data

Splice Machine: SQL-on-Hadoop Evaluation Guide

Introduction to Big Data Training

Hadoop & its Usage at Facebook

Testing Big data is one of the biggest

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Transcription:

Practical Hadoop by Example for relational database professioanals Alex Gorbachev 12-Mar-2013 New York, NY

Alex Gorbachev Chief Technology Officer at Pythian Blogger OakTable Network member Oracle ACE Director Founder of BattleAgainstAnyGuess.com Founder of Sydney Oracle Meetup IOUG Director of Communities EVP, Ottawa Oracle User Group 2

Why Companies Trust Pythian Recognized Leader: Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and SQL Server Work with over 150 multinational companies such as Forbes.com, Fox Interactive media, and MDS Inc. to help manage their complex IT deployments Expertise: One of the world s largest concentrations of dedicated, full-time DBA expertise. Global Reach & Scalability: 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response 3 3

Agenda What is Big Data? What is Hadoop? Hadoop use cases Moving data in and out of Hadoop Avoiding major pitfalls

What is Big Data?

Doesn t Matter. We are here to discuss data architecture and use cases. Not define market segments.

What Does Matter? Some data types are a bad fit for RDBMS. Some problems are a bad fit for RDBMS. We can call them BIG if you want. Data Warehouses have always been BIG.

Given enough skill and money Oracle can do anything. Lets talk about efficient solutions.

When RDBMS Makes no Sense? Storing images and video Processing images and video Storing and processing other large files PDFs, Excel files Processing large blocks of natural language text Blog posts, job ads, product descriptions Processing semi-structured data CSV, JSON, XML, log files Sensor data

When RDBMS Makes no Sense? Ad-hoc, exploratory analytics Integrating data from external sources Data cleanup tasks Very advanced analytics (machine learning)

New Data Sources Blog posts Social media Images Videos Logs from web applications Sensors They all have large potential value But they are awkward fit for traditional data warehouses

Big Problems with Big Data It is: Unstructured Unprocessed Un-aggregated Un-filtered Repetitive Low quality And generally messy. Oh, and there is a lot of it.

Technical Challenges Storage capacity Storage throughput Pipeline throughput Processing power Parallel processing System Integration Data Analysis Scalable storage Massive Parallel Processing Ready to use tools

Big Data Solutions Real-time transactions at very high scale, always available, distributed Analytics and batch-like workload on very large volume often unstructured Relaxing ACID rules Atomicity Consistency Isolation Durability Example: eventual consistency in Cassandra Massively scalable Throughput oriented Sacrifice efficiency for scale Hadoop is most industry accepted standard / tool

What is Hadoop?

Hadoop Principles Bring Code to Data Share Nothing

Hadoop in a Nutshell Replicated Distributed Big- Data File System Map-Reduce - framework for writing massively parallel jobs

HDFS architecture simplified view Files are split in large blocks Each block is replicated on write Files can be only created and deleted by one client Uploading new data? => new file Append supported in recent versions Update data? => recreate file No concurrent writes to a file Clients transfer blocks directly to & from data nodes Data nodes use cheap local disks Local reads are efficient

HDFS design principles

Map Reduce example histogram calculation

Map Reduce pros & cons Advantages Very simple Flexible Highly scalable Good fit for HDFS mappers read locally Fault tolerant Pitfalls Low efficiency Lots of intermediate data Lots of network traffic on shuffle Complex manipulation requires pipeline of multiple jobs No high-level language Only mappers leverage local reads on HDFS

Main components of Hadoop ecosystem Hive HiveQL is SQL like query language Generates MapReduce jobs Pig data sets manipulation language (like create your own query execution plan) Generates MapReduce jobs Zookeeper distributed cluster manager Oozie workflow scheduler services Sqoop transfer data between Hadoop and relational

Non-MR processing on Hadoop HBase columnar-oriented key-value store (NoSQL) SQL without Map Reduce Impala (Cloudera) Drill (MapR) Phoenix (Salesforce.com) Hadapt (commercial) Shark Spark in-memory analytics on Hadoop

Hadoop Benefits Reliable solution based on unreliable hardware Designed for large files Load data first, structure later Designed to maximize throughput of large scans Designed to leverage parallelism Designed to scale Flexible development platform Solution Ecosystem

Hadoop Limitations Hadoop is scalable but not fast Some assembly required Batteries not included Instrumentation not included either DIY mindset (remember MySQL?)

How much does it cost? $300K DIY on SuperMicro 100 data nodes 2 name nodes 3 racks 800 Sandy Bridge CPU cores 6.4 TB RAM 600 x 2TB disks 1.2 PB of raw disk capacity 400 TB usable (triple mirror) Open-source s/w

Hadoop Use Cases

Use Cases for Big Data Top-line contributions Analyze customer behavior Optimize ad placements Customized promotions and etc Recommendation systems Netflix, Pandora, Amazon Improve connection with your customers Know your customers patterns and responses Bottom-line contributors Cheap archives storage ETL layer transformation engine, data cleansing

Typical Initial Use-Cases for Hadoop in modern Enterprise IT Transformation engine (part of ETL) Scales easily Inexpensive processing capacity Any data source and destination Data Landfill Stop throwing away any data Don t know how to use data today? Maybe tomorrow you will Hadoop is very inexpensive but very reliable

Advanced: Data Science Platform Data warehouse is good when questions are known, data domain and structure is defined Hadoop is great for seeking new meaning of data, new types of insights Unique information parsing and interpretation Huge variety of data sources and domains When new insights are found and new structure defined, Hadoop often takes place of ETL engine Newly structured information is then loaded to more traditional datawarehouses (still today)

Pythian Internal Hadoop Use OCR of screen video capture from Pythian privileged access surveillance system Input raw frames from video capture Map-Reduce job runs OCR on frames and produces text Map-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screen Other Map-Reduce jobs mine text (and keystrokes) for insights Credit Cart patterns Sensitive commands (like DROP TABLE) Root access Unusual activity patterns Merge with monitoring and documentation systems

Hadoop in the Data Warehouse Use Cases and Customer Stories

ETL for Unstructured Data Logs Web servers, app server, clickstreams Flume Hadoop Cleanup, aggregation Longterm storage DWH BI, batch reports

ETL for Structured Data OLTP Oracle, MySQL, Informix Sqoop, Perl Hadoop Transformation aggregation Longterm storage DWH BI, batch reports

Bring the World into Your Datacenter

Rare Historical Report

Find Needle in Haystack

Hadoop for Oracle DBAs? alert.log repository listener.log repository Statspack/AWR/ASH repository trace repository DB Audit repository Web logs repository SAR repository SQL and execution plans repository Database jobs execution logs

Connecting the (big) Dots

Sqoop Queries

Sqoop is Flexible Import Select <columns> from <table> where <condition> Or <write your own query> Split column Parallel Incremental File formats

Sqoop Import Examples Sqoop import - - connect jdbc:oracle:thin:@// dbserver:1521/masterdb - - username hr - - table emp - - where start_date > 01-01- 2012 Sqoop import jdbc:oracle:thin:@//dbserver:1521/ masterdb - - username myuser - - table shops - - split- by shop_id - - num- mappers 16 Must be indexed or partitioned to avoid 16 full table scans

Less Flexible Export 100 row batch inserts Commit every 100 batches Parallel export Merge vs. Insert Example: sqoop export - - connect jdbc:mysql://db.example.com/foo - - table bar - - export- dir /results/bar_data

FUSE-DFS Mount HDFS on Oracle server: sudo yum install hadoop-0.20-fuse hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> Use external tables to load data into Oracle File Formats may vary All ETL best practices apply

Oracle Loader for Hadoop Load data from Hadoop into Oracle Map-Reduce job inside Hadoop Converts data types, partitions and sorts Direct path loads Reduces CPU utilization on database NEW: Support for Avro Support for compression codecs

Oracle Direct Connector to HDFS Create external tables of files in HDFS PREPROCESSOR HDFS_BIN_PATH:hdfs_stream All the features of External Tables Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS

Oracle SQL Connector for HDFS Map-Reduce Java program Creates an external table Can use Hive Metastore for schema Optimized for parallel queries Supports Avro and compression

How not to Fail

Data That Belong in RDBMS

Prepare for Migration

Use Hadoop Efficiently Understand your bottlenecks: CPU, storage or network? Reduce use of temporary data: All data is over the network Written to disk in triplicate. Eliminate unbalanced workloads Offload work to RDBMS Fine-tune optimization with Map-Reduce

Your Data is NOT as BIG as you think

Getting started Pick a business problem Acquire data Get the tools: Hadoop, R, Hive, Pig, Tableau Get platform: can start cheap Analyze data Need Data Analysts a.k.a. Data Scientists Pick an operational problem Data store ETL Get the tools: Hadoop, Sqoop, Hive, Pig, Oracle Connectors Get platform: Ops suitable Operational team

Continue Your Education Communities Knowledge Saring Education www.collaborate13.ioug.org

Thank you & Q&A To contact us sales@pythian.com 1-866-PYTHIAN To follow us http://www.pythian.com/news/ http://www.facebook.com/pages/the-pythian-group/ http://twitter.com/pythian http://www.linkedin.com/company/pythian 5