Apache Sqoop. A Data Transfer Tool for Hadoop



Similar documents
Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Teradata Connector for Hadoop Tutorial

Constructing a Data Lake: Hadoop and Oracle Database United!

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Cloudera Certified Developer for Apache Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

A Brief Introduction to MySQL

Integrating VoltDB with Hadoop

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A New GeneraAon of Data Transfer Tools for Hadoop: Sqoop 2

Hadoop Ecosystem B Y R A H I M A.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Qsoft Inc

How, What, and Where of Data Warehouses for MySQL

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Data processing goes big

Move Data from Oracle to Hadoop and Gain New Business Insights

Big Data Too Big To Ignore

The Hadoop Eco System Shanghai Data Science Meetup

<Insert Picture Here> Big Data

Comparing SQL and NOSQL databases

Data Domain Profiling and Data Masking for Hadoop

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Connecting Hadoop with Oracle Database

Replicating to everything

Hadoop and MySQL for Big Data

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Implement Hadoop jobs to extract business value from large and varied data sets

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Native Connectivity to Big Data Sources in MSTR 10

Using the SQL TAS v4

Data storing and data access

Important Notice. (c) Cloudera, Inc. All rights reserved.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

How to get started with Hadoop and your favorite databases

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Package sjdbc. R topics documented: February 20, 2015

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Practical Hadoop by Example

Apache Sentry. Prasad Mujumdar

Integrating Apache Spark with an Enterprise Data Warehouse

How To Scale Out Of A Nosql Database

Cloudera Manager Installation Guide

Big Data SQL and Query Franchising

Using RDBMS, NoSQL or Hadoop?

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Simba Apache Cassandra ODBC Driver

Lucid Key Server v2 Installation Documentation.

Architecting the Future of Big Data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Integration of Apache Hive and HBase

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Peers Techno log ies Pv t. L td. HADOOP

Informatica Data Replication FAQs

Apache Hadoop: Past, Present, and Future

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Linas Virbalas Continuent, Inc.

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Workshop on Hadoop with Big Data

Bringing Big Data to People

Hive Interview Questions

Architecting the Future of Big Data

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Technologies Compared June 2014

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Performance and Scalability Overview

IBM SPSS Analytic Server Version 2. User's Guide

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Big Data Course Highlights

Media Upload and Sharing Website using HBASE

Using distributed technologies to analyze Big Data

ITG Software Engineering

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. DATA MANAGEMENT FOR ANALYTICS

Moving From Hadoop to Spark

SQL Server An Overview

QlikView 11.2 SR5 DIRECT DISCOVERY

Transcription:

Apache Sqoop A Data Transfer Tool for Hadoop Arvind Prabhakar, Cloudera Inc. Sept 21, 2011

What is Sqoop? Allows easy import and export of data from structured data stores: o Relational Database o Enterprise Data Warehouse o NoSQL Datastore Allows easy integration with Hadoop based systems: o Hive o HBase o Oozie

Agenda Motivation Importing and exporting data using Sqoop Provisioning Hive Metastore Populating HBase tables Sqoop Connectors Current Status

Motivation Structured data stored in Databases and EDW is not easily accessible for analysis in Hadoop Access to Databases and EDW from Hadoop Clusters is problematic. Forcing MapReduce to access data from Databases/EDWs is repititive, error-prone and non-trivial. Data preparation often required for efficient consumption by Hadoop based data pipelines. Current methods of transferring data are inefficient/adhoc.

Enter: Sqoop A tool to automate data transfer between structured datastores and Hadoop. Highlights Uses datastore metadata to infer structure definitions Uses MapReduce framework to transfer data in parallel Allows structure definitions to be provisioned in Hive metastore Provides an extension mechanism to incorporate high performance connectors for external systems.

Importing Data mysql> describe ORDERS; +-----------------+-------------+------+-----+---------+-------+ Field Type Null Key Default Extra +-----------------+-------------+------+-----+---------+-------+ ORDER_NUMBER int(11) NO PRI NULL ORDER_DATE datetime NO NULL REQUIRED_DATE datetime NO NULL SHIP_DATE datetime YES NULL STATUS varchar(15) NO NULL COMMENTS text YES NULL CUSTOMER_NUMBER int(11) NO NULL +-----------------+-------------+------+-----+---------+-------+ 7 rows in set (0.00 sec)

Importing Data $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** INFO mapred.jobclient: Counters: 12 INFO mapred.jobclient: Job Counters INFO mapred.jobclient: SLOTS_MILLIS_MAPS=12873 INFO mapred.jobclient: Launched map tasks=4 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES=0 INFO mapred.jobclient: FileSystemCounters INFO mapred.jobclient: HDFS_BYTES_READ=505 INFO mapred.jobclient: FILE_BYTES_WRITTEN=222848 INFO mapred.jobclient: HDFS_BYTES_WRITTEN=35098 INFO mapred.jobclient: Map-Reduce Framework INFO mapred.jobclient: Map input records=326 INFO mapred.jobclient: Spilled Records=0 INFO mapred.jobclient: Map output records=326 INFO mapred.jobclient: SPLIT_RAW_BYTES=505 INFO mapreduce.importjobbase: Transferred 34.2754 KB in 11.2754 seconds (3.0398 KB/sec) INFO mapreduce.importjobbase: Retrieved 326 records.

Importing Data $ hadoop fs -ls Found 32 items. drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/orders. $ hadoop fs -ls /user/arvind/orders arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/orders Found 6 items 0 2011-09-13 19:12 /user/arvind/orders/_success 0 2011-09-13 19:12 /user/arvind/orders/_logs 8826 2011-09-13 19:12 /user/arvind/orders/part-m-00000 8760 2011-09-13 19:12 /user/arvind/orders/part-m-00001 8841 2011-09-13 19:12 /user/arvind/orders/part-m-00002 8671 2011-09-13 19:12 /user/arvind/orders/part-m-00003

Exporting Data $ sqoop export --connect jdbc:mysql://localhost/acmedb \ --table ORDERS_CLEAN --username test --password **** \ --export-dir /user/arvind/orders INFO mapreduce.exportjobbase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/ sec) INFO mapreduce.exportjobbase: Exported 326 records. $ Default Delimiters: ',' for fields, New-Lines for records Optionally Specify Escape sequence Delimiters can be specified for both import and export

Exporting Data Exports can optionally use Staging Tables Map tasks populate staging table Each map write is broken down into many transactions Staging table is then used to populate the target table in a single transaction In case of failure, staging table provides insulation from data corruption.

Importing Data into Hive $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** --hive-import INFO mapred.jobclient: Counters: 12 INFO mapreduce.importjobbase: Transferred 34.2754 KB in 11.3995 seconds (3.0068 KB/ sec) INFO mapreduce.importjobbase: Retrieved 326 records. INFO hive.hiveimport: Removing temporary files from import process: ORDERS/_logs INFO hive.hiveimport: Loading uploaded data into Hive WARN hive.tabledefwriter: Column ORDER_DATE had to be cast to a less precise type in Hive WARN hive.tabledefwriter: Column REQUIRED_DATE had to be cast to a less precise type in Hive WARN hive.tabledefwriter: Column SHIP_DATE had to be cast to a less precise type in Hive $

Importing Data into Hive $ hive hive> show tables; OK orders hive> describe orders; OK order_number int order_date string required_date string ship_date string status string comments string customer_number int Time taken: 0.236 seconds hive>

Importing Data into HBase $ bin/sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --hbase-create-table --hbase-table ORDERS --column-family mysql INFO mapreduce.hbaseimportjob: Creating missing HBase table ORDERS INFO mapreduce.importjobbase: Retrieved 326 records. $ Sqoop creates the missing table if instructed If no Row-Key specified, the Primary Key column is used. Each output column placed in same column family Every record read results in an HBase put operation All values are converted to their string representation and inserted as UTF-8 bytes.

Importing Data into HBase hbase(main):001:0> list TABLE ORDERS 1 row(s) in 0.3650 seconds hbase(main):002:0> describe 'ORDERS' DESCRIPTION ENABLED {NAME => 'ORDERS', FAMILIES => [ true {NAME => 'mysql', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0310 seconds hbase(main):003:0>

Importing Data into HBase hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 } ROW COLUMN+CELL 10100 column=mysql:customer_number,timestamp=1316036948264, value=363 10100 column=mysql:order_date, timestamp=1316036948264, value=2003-01-06 00:00:00.0 10100 column=mysql:required_date, timestamp=1316036948264, value=2003-01-13 00:00:00.0 10100 column=mysql:ship_date, timestamp=1316036948264, value=2003-01-10 00:00:00.0 10100 column=mysql:status, timestamp=1316036948264, value=shipped 1 row(s) in 0.0130 seconds hbase(main):012:0>

Sqoop Connectors Connector Mechanism allows creation of new connectors that improve/augment Sqoop functionality. Bundled connectors include: o MySQL, PostgreSQL, Oracle, SQLServer, JDBC o Direct MySQL, Direct PostgreSQL Regular connectors are JDBC based. Direct Connectors use native tools for high-performance data transfer implementation.

Import using Direct MySQL Connector $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** --direct manager.directmysqlmanager: Beginning mysqldump fast path import Direct import works as follows: Data is partitioned into splits using JDBC Map tasks used mysqldump to do the import with conditional selection clause (-w 'ORDER_NUMBER' > ) Header and footer information was stripped out Direct Export similarly uses mysqlimport utility.

Third Party Connectors Oracle - Developed by Quest Software Couchbase - Developed by Couchbase Netezza - Developed by Cloudera Teradata - Developed by Cloudera Microsoft SQL Server - Developed by Microsoft Microsoft PDW - Developed by Microsoft Volt DB - Developed by VoltDB

Current Status Sqoop is currently in Apache Incubator Status Page http://incubator.apache.org/projects/sqoop.html Mailing Lists sqoop-user@incubator.apache.org sqoop-dev@incubator.apache.org Release Current shipping version is 1.3.0

Sqoop Meetup Monday, November 7-2011, 8pm - 9pm at Sheraton New York Hotel & Towers, NYC

Thank you! Q & A