Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide



Similar documents
Ontrack PowerControls V8.1 for SQL ReadMe

SQL Server An Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Using SQL Server Management Studio

Teradata Connector for Hadoop Tutorial

Ontrack PowerControls User Guide Version 8.0

ITG Software Engineering

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

How To Create A Table In Sql (Ahem)

ODBC Client Driver Help Kepware, Inc.

Cloudera Certified Developer for Apache Hadoop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Architecting the Future of Big Data

Qsoft Inc

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Constructing a Data Lake: Hadoop and Oracle Database United!

Integrating VoltDB with Hadoop

Architecting the Future of Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Important Notice. (c) Cloudera, Inc. All rights reserved.

Hadoop and MySQL for Big Data

Architecting the Future of Big Data

Apache Sqoop. A Data Transfer Tool for Hadoop

Big Data Course Highlights

Hadoop Integration Guide

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Hadoop Integration Guide

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

COURSE CONTENT Big Data and Hadoop Training

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Move Data from Oracle to Hadoop and Gain New Business Insights

Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vsphere SOLUTION GUIDE

project collects data from national events, both natural and manmade, to be stored and evaluated by

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Black Hat Briefings USA 2004 Cameron Hotchkies

Big Data on Microsoft Platform

RHadoop Installation Guide for Red Hat Enterprise Linux

sqlite driver manual

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Peers Techno log ies Pv t. L td. HADOOP

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Internals of Hadoop Application Framework and Distributed File System

SmoothWall Virtual Appliance

Hadoop IST 734 SS CHUNG

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Comparison of Open Source RDBMS

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Upgrading From PDI 4.0 to 4.1.0

MS ACCESS DATABASE DATA TYPES

ORACLE GOLDENGATE BIG DATA ADAPTER FOR HIVE

Data processing goes big

Control-M for Hadoop. Technical Bulletin.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Map Reduce & Hadoop Recommended Text:

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Xiaoming Gao Hui Li Thilina Gunarathne

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

vcenter Operations Management Pack for SAP HANA Installation and Configuration Guide

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Data Analyst Program- 0 to 100

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Business Interaction Server. Configuration Guide Rev A

Spectrum Technology Platform. Version 9.0. Enterprise Data Integration Guide

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. DATA MANAGEMENT FOR ANALYTICS

docs.hortonworks.com

Micro Focus. Data Express. SQL Server Module User Guide

Hadoop and Map-Reduce. Swati Gore

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

1 Changes in this release

Click Stream Data Analysis Using Hadoop

EventTracker: Support to Non English Systems

Workshop on Hadoop with Big Data

SharePlex Connector for Hadoop 1.0. Installation and Setup Guide

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Hadoop Basics with InfoSphere BigInsights

HSearch Installation

Cloudera Manager Training: Hands-On Exercises

Jet Data Manager 2012 User Guide

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Centrify Identity and Access Management for Cloudera

Complete Java Classes Hadoop Syllabus Contact No:

Database Migration from MySQL to RDM Server

Cloudera ODBC Driver for Apache Hive Version 2.5.5

Cloudera ODBC Driver for Apache Hive Version

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Bitrix Site Manager ASP.NET. Installation Guide

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Enhanced Connector Applications SupportPac VP01 for IBM WebSphere Business Events 3.0.0

Transcription:

Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011

Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported File Types... 4 Before You Install SQL Server-Hadoop Connector... 5 Requirements... 5 Step 1: Install and Configure Cloudera s Distribution Including Hadoop... 5 Step 2: Install and Configure Sqoop... 5 Step 3: Download and install the Microsoft JDBC Driver... 5 Download and Install SQL Server-Hadoop Connector... 7 Example Import Commands... 8 Example 1: Import to delimited text files on HDFS... 8 Example 2: Import with the split-by option... 8 Example 3: Import to SequenceFiles on HDFS... 8 Example 4: Import to tables in Hive... 8 Example Export Commands... 9 Example 1: Export data from a delimited text on HDFS... 9 Example 2: Export data from a delimited text file or Sequence File on HDFS with a user-defined number of mappers.... 9 Example 3: Export data from delimited text or sequence file on HDFS using a staging table... 9 Data Types... 10 Known Issues... 13 Troubleshooting and Support... 14 Security Notes... 15 Copyright 2011 Microsoft Corporation. Page 2

Legal Notice This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. Copyright 2011 Microsoft Corporation. Some information relates to pre-released product which may be substantially modified before it s commercially released. Microsoft makes no warranties, express or implied, with respect to the information provided here. Copyright 2011 Microsoft Corporation. Page 3

Introduction What is SQL Server-Hadoop Connector? Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) is a Sqoop-based connector that facilitates efficient data transfer between SQL Server 2008 R2 and Hadoop. Sqoop supports several databases. This connector extends JDBC-based Sqoop connectivity to facilitate data transfer between SQL Server and Hadoop, and also supports the JDBC features as mentioned in the SQOOP User Guide on the Cloudera website. In addition to this, this connector provides support for nchar and nvarchar data types. With SQL Server-Hadoop Connector, you import data from: tables in SQL Server to delimited text files on HDFS tables in SQL Server to SequenceFiles files on HDFS tables in SQL Server to tables in Hive* result of queries executed on SQL Server to delimited text files on HDFS result of queries executed on SQL Server to SequenceFiles files on HDFS result of queries executed on SQL Server to tables in Hive* Note: importing data from SQL Server into HBase is not supported in this release. With SQL Server-Hadoop Connector, you can export data from: delimited text files on HDFS to SQL Server sequencefiles on HDFS to SQL Server hive Tables * to tables in SQL Server * Hive is a data warehouse infrastructure built on top of Hadoop (http://wiki.apache.org/hadoop/hive). We recommend to use hive-0.7.0-cdh3u0 version of Cloudera Hive. What is Sqoop? Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance. Supported File Types The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in the Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to the Hadoop API page. Copyright 2011 Microsoft Corporation. Page 4

Before You Install SQL Server-Hadoop Connector The following requirements and steps explain how to prepare your system before installing SQL Server-Hadoop Connector. Requirements This User Guide assumes your environment has both Linux (for Hadoop setup) and Windows (with SQL Server setup). Both are required to use the SQL Server-Hadoop Connector. Step 1: Install and Configure Cloudera s Distribution Including Hadoop The first installation step is to install and configure Cloudera s Distribution Including Hadoop Update 1 (CDH3U1) on Linux. This is available for download from the Cloudera site at www.cloudera.com/downloads. We also support Cloudera s CDH3U0 distribution of Hadoop for this connector, but we recommend Cloudera s CDH3U1 distribution of Hadoop. Set the HADOOP_HOME environment variable to the parent directory where Hadoop is installed. Step 2: Install and Configure Sqoop The next step is to install and configure Sqoop, if not already installed, on the master node of the Hadoop cluster. We recommend downloading and installing SQOOP 1.3.0-cdh3u1 (sqoop-1.3.0-cdh3u1.tar.gz ) from http://archive.cloudera.com/cdh/3/. For detailed instructions about using Sqoop, see the Sqoop User Guide at http://archive.cloudera.com/cdh/3/sqoop-1.3.0-cdh3u1/sqoopuserguide.html. SQL Server Hadoop Connector has backward compatibility with Sqoop-1.2.0, but, we recommended using Sqoop 1.3.0. After installing and configuring Sqoop, verify the following environment variables are set on the machine with Sqoop installation, as described in the following table. These must be set for SQL Server-Hadoop Connector to work correctly. Environment Variable SQOOP_HOME SQOOP_CONF_DIR Value to Assign Absolute path to the Sqoop installation directory $SQOOP_HOME/conf Step 3: Download and install the Microsoft JDBC Driver Sqoop and SQL Server-Hadoop Connector use JDBC technology to establish connections to remote RDBMS servers and therefore needs the JDBC driver for SQL Server. To install this driver on Linux node where Sqoop is already installed: Visit http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599 and download sqljdbc_<version>_enu.tar.gz. Copy it on the machine with Sqoop installation. Unpack the tar file using following command: tar zxvf sqljdbc_<version>_enu.tar.gz. This will create a directory sqljdbc_3.0 in current directory. Copyright 2011 Microsoft Corporation. Page 5

Copy the driver jar (sqljdbc_3.0/enu/sqljdbc4.jar) file to the $SQOOP_HOME/lib directory on machine with Sqoop installation. Copyright 2011 Microsoft Corporation. Page 6

Download and Install SQL Server-Hadoop Connector After all of the previous steps have completed, you are ready to download, install and configure the SQL Server- Hadoop Connector on the machine with Sqoop installation. The SQL Server Hadoop connector is distributed as a compressed tar archive named sqoop-sqlserver-1.0.tar.gz. Download the tar archive from http://download.microsoft.com, and save the archive on the same machine where Sqoop is installed. This archive is composed of the following files and directories: File / Directory install.sh Microsoft SQL Server-Hadoop Connector User Guide.pdf lib/ conf/ THIRDPARTYNOTICES FOR HADOOP-BASED CONNECTORS.txt SQL Server Connector for Apache Hadoop MSLT.pdf Description Is a shell script that installs the SQL Server Hadoop Connector files into the Sqoop directory structure. Contains instructions to deploy and execute SQL Server Hadoop Connector. Contains the sqoop-sqlserver-1.0.jar file Contains the configuration files for SQL Server Hadoop Connector. Contains the third party notices. EULA for the SQL Server Connector for Apache Hadoop To install SQL Server Hadoop Connector: 1. Login to the machine where Sqoop is installed as a user who has permission to install files 2. Extract the archive with the command: tar zxvf sqoop-sqlserver-1.0.tar.gz. This will create sqoopsqlserver-1.0 directory in current directory 3. Change directory (cd) to sqoop-sqlserver-1.0 4. Ensure that the MSSQL_CONNECTOR_HOME environment variable is set to the absolute path of the sqoop -sqlserver-1.0 directory. 5. Run the shell script install.sh with no additional arguments. 6. Installer will copy the connector jar and configuration file under existing Sqoop installation Copyright 2011 Microsoft Corporation. Page 7

Example Import Commands You re now ready to use SQL Server-Hadoop Connector. The following examples import data from SQL Server to HDFS or Hive. The assumption is that you are running the commands from the $SQOOP_HOME directory on the master node of the Hadoop Cluster, where Sqoop is installed. Example 1: Import to delimited text files on HDFS The following command imports data from TPCH lineitem table in SQL Server to delimited text files in /data/lineitemdata directory on HDFS: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- target-dir /data/lineitemdata Example 2: Import with the split-by option The following command specifies split-by column to compute the splits for mappers: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- target-dir /data/lineitemdata --split-by L_ORDERKEY -m 3 Example 3: Import to SequenceFiles on HDFS The following command imports data in SequenceFiles on HDFS: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- target-dir /data/lineitemdata --as-sequencefile Example 4: Import to tables in Hive The following command imports data from lineitem tables in SQL Server to a table in Hive: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- hive-import For using Hive import, ensure that hive is installed and HIVE_HOME is set to the parent directory where hive is installed. Copyright 2011 Microsoft Corporation. Page 8

Example Export Commands The following examples export data from HDFS or Hive to SQL Server. The assumption is that you are running the commands from the $SQOOP_HOME directory on the master node of the Hadoop Cluster, where Sqoop is installed. Example 1: Export data from a delimited text on HDFS The following command exports data from a delimited text file /data/lineitemdata on HDFS to lineitem table in tpch database on SQL Server. $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- export-dir /data/lineitemdata Example 2: Export data from a delimited text file or Sequence File on HDFS with a user-defined number of mappers. The following command exports data from a delimited text file on HDFS with user defined number of mappers. $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- export-dir /data/lineitemdata -m 3 The following command exports data from a sequential file on HDFS. In the following example, the --jar-file <ORM_JAR_FILE> --classname <ORM_ClassName> parameters specify the jar file and the appropriate class name that needs to loaded from this jar file. For more details on these options, see the Sqoop User Guide. $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- export-dir /data/lineitemdata -m 3 --class-name <ORM_ClassName> --jar-file <ORM_JAR_FILE> Example 3: Export data from delimited text or sequence file on HDFS using a staging table The following command uses a staging table and specifies to first clear the staging table before starting the export. $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemdata --staging-table lineitem_stage --clear-staging-table Note: For current release, using --direct option for running Sqoop import / export tools would make no difference in execution of import / export flow. Copyright 2011 Microsoft Corporation. Page 9

Data Types The following table summarizes the data types supported by this version of the SQL Server Hadoop Connector. All other SQL Server types (e.g., XML, geography, geometry, sql_variant) not mentioned in the table below are not supported at this time. Data type Category SQL Server Data Type SQL server data type Range Sqoop Data Type Sqoop Data type Range Exact numeric bigint -2^63 to 2^63-1 Long MAX_VALUE: 2^63-1 (9223372036854775807) MIN_VALUE: -2^63 (- 9223372036854775808) bit 0 or 1 Boolean 1-bit Approximate numeric decimal - 10^38 +1 to 10^38-1 java.math.bigdecimal No range specification found (non-lossy) int -2^31 to 2^31-1 Integer MAX_VALUE: 2^31-1 (2147483647) MIN_VALUE: -2^31 (- 2147483648) money - 922,337,203,685,477.5808 to 922,337,203,685,477.5807 java.math.bigdecimal No range specification found (non-lossy) smallint -2^15 to 2^15-1 Integer MAX_VALUE: 2^31-1 (2147483647) MIN_VALUE: -2^31 (- 2147483648) smallmoney - 214,748.3648 to 214,748.3647 java.math.bigdecimal No range specification found (non-lossy) tinyint 0 to 255 Integer MAX_VALUE: 2^31-1 (2147483647) MIN_VALUE: -2^31 (- 2147483648) float - 1.79E+308 to -2.23E-308, 0 and 2.23E-308 to 1.79E+308 real - 3.40E + 38 to -1.18E - 38, 0 and 1.18E - 38 to 3.40E The ISO synonym for real is float(24) Date and time date 0001-01-01 through 9999-12-31 January 1, 1 A.D. through December 31, 9999 A.D. Double Float java.sql.date MAX_VALUE: (2-2^- 52) 2^1023 or (1.7976931348623157E3 08d) MIN_VALUE: 2^-1074 or (4.9E-324d) MAX_VALUE: (2-2^- 23) 2^127 or (3.4028234663852886E3 8f) MIN_VALUE: 2^-149 or (1.401298464324817E- 45f) int year, int month, int date: Year - the year minus 1900; Must be 0 to 8099. (Note that 8099 is 9999 minus 1900.) month - 0 to 11 day - 1 to 31 Copyright 2011 Microsoft Corporation. Page 10

Data type Category SQL Server Data Type SQL server data type Range Sqoop Data Type Sqoop Data type Range datetime2 Date Range: 0001-01-01 through 9999-12-31 January 1,1 A.D. through December 31, 9999 A.D. java.sql.timestamp int year, int month, int date, int hour, int minute, int second, int nano: Time Range: 00:00:00 through 23:59:59.9999999 year - the year minus 1900 month - 0 to 11 date - 1 to 31 hour - 0 to 23 minute - 0 to 59 second - 0 to 59 nano - 0 to 999,999,999 smalldatetime Date Range: 1900-01-01 through 2079-06-06 January 1, 1900, through June 6, 2079 java.sql.timestamp int year, int month, int date, int hour, int minute, int second, int nano: Time Range: 00:00:00 through 23:59:59 2007-05-09 23:59:59 will round to 2007-05-10 00:00:00 year - the year minus 1900 month - 0 to 11 date - 1 to 31 hour - 0 to 23 minute - 0 to 59 second - 0 to 59 nano - 0 to 999,999,999 datetime Date Range: January 1, 1753, through December 31, 9999 java.sql.timestamp int year, int month, int date, int hour, int minute, int second, int nano: Time Range: 00:00:00 through 23:59:59.997 year - the year minus 1900 month - 0 to 11 date - 1 to 31 hour - 0 to 23 minute - 0 to 59 second - 0 to 59 nano - 0 to 999,999,999 time 00:00:00.0000000 through 23:59:59.9999999 java.sql.time int hour, int minute, int second: Character strings char Fixed-length, non-unicode character data With a length of n bytes. n must be a value from 1 through 8,000. varchar Variable-length, non- Unicode character data. n can be a value from 1 through 8,000. String String hour - 0 to 23 minute - 0 to 59 second - 0 to 59 Up to 8,000 characters Up to 8,000 characters Varchar(max) not supported. Copyright 2011 Microsoft Corporation. Page 11

Data type Category SQL Server Data Type SQL server data type Range Sqoop Data Type Sqoop Data type Range Unicode character strings nchar Fixed-length Unicode character data of n characters. n must be a value From 1 through 4,000. String Up to 4,000 unicode characters nvarchar Variable-length Unicode character data, n can be a Value from 1 through 4,000. String Up to 4,000 unicode characters Nvarchar(max) not supported. Binary strings binary Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000. varbinary Variable-length binary data. n can be a value from 1 through 8,000. BytesWritable.java BytesWritable.java Up to 8,000 bytes Up to 8,000 bytes Varbinary(max) not supported. Copyright 2011 Microsoft Corporation. Page 12

Known Issues This JDBC-based connector is an extension of Sqoop and the open issues in Sqoop also occur in this connector. For a detailed description of Sqoop known issues, see https://issues.apache.org/jira/browse/sqoop. The use of --driver switch does not function correctly and hence avoid using --driver switch for the SQL Server connector to work. Use the --connect switch instead. Copyright 2011 Microsoft Corporation. Page 13

Troubleshooting and Support This JDBC-based connector is an extension of Sqoop. For troubleshooting and support details with respect to Sqoop, see the Sqoop User Guide on the Cloudera site. Copyright 2011 Microsoft Corporation. Page 14

Security Notes For secure communication between the Hadoop nodes, we recommend users to configure IPsec or similar technologies. This will help prevent the Man-In the Middle attack. You can refer the following link: https://help.ubuntu.com/community/ipsechowto We recommend using the escaped-by and enclosed-by switches provided in Sqoop. To ensure secure communication between the Hadoop nodes and SQL Server use encrypt=true in the connection string. For details refer to the following link http://msdn.microsoft.com/enus/library/bb879949.aspx on MSDN. This is recommended but is not tested with the current release. Copyright 2011 Microsoft Corporation. Page 15