Hadoop Integration Guide



Similar documents
Hadoop Integration Guide

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Plug-In for Informatica Guide

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

HP Vertica Integration with SAP Business Objects: Tips and Techniques. HP Vertica Analytic Database

HP SiteScope. HP Vertica Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems. Software Version: 11.

Integrating VoltDB with Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.1.x

Xiaoming Gao Hui Li Thilina Gunarathne

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Teradata Connector for Hadoop Tutorial

Connectivity Pack for Microsoft Guide

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Data processing goes big

Architecting the Future of Big Data

Connecting Hadoop with Oracle Database

Important Notice. (c) Cloudera, Inc. All rights reserved.

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Basics with InfoSphere BigInsights

HP Software as a Service

Vertica OnDemand Getting Started Guide HPE Vertica Analytic Database. Software Version: 7.2.x

The Hadoop Eco System Shanghai Data Science Meetup

Constructing a Data Lake: Hadoop and Oracle Database United!

Supported Platforms HPE Vertica Analytic Database. Software Version: 7.2.x

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Hadoop Job Oriented Training Agenda

HP Quality Center. Upgrade Preparation Guide

Backing up and restoring HP Systems Insight Manager 6.0 or greater data files in a Windows environment

HP Data Protector Integration with Autonomy IDOL Server

Internals of Hadoop Application Framework and Distributed File System

HP Device Manager 4.7

HP SiteScope. Hadoop Cluster Monitoring Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems

HP Vertica on Amazon Web Services Backup and Restore Guide

HP Operations Orchestration Software

Architecting the Future of Big Data

unisys ClearPath Servers Hadoop Distributed File System(HDFS) Data Transfer Guide Firmware 2.0 and Higher December

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

COURSE CONTENT Big Data and Hadoop Training

Getting to know Apache Hadoop

Architecting the Future of Big Data

HP Asset Manager. Implementing Single Sign On for Asset Manager Web 5.x. Legal Notices Introduction Using AM

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

HP LeftHand SAN Solutions

Move Data from Oracle to Hadoop and Gain New Business Insights

HP AppPulse Active. Software Version: 2.2. Real Device Monitoring For AppPulse Active

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support...

ORACLE GOLDENGATE BIG DATA ADAPTER FOR HIVE

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

HP AppPulse Mobile. Adding HP AppPulse Mobile to Your Android App

HP Service Manager. Software Version: 9.40 For the supported Windows and Linux operating systems. Application Setup help topics for printing

HP Project and Portfolio Management Center

How To Use Hp Vertica Ondemand

StreamServe Persuasion SP5 Microsoft SQL Server

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

HP Device Manager 4.7

HP Quality Center. Software Version: Microsoft Word Add-in Guide

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

HP LeftHand SAN Solutions

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Hadoop Streaming. Table of contents

HP Intelligent Management Center v7.1 Virtualization Monitor Administrator Guide

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

MR-(Mapreduce Programming Language)

HP LeftHand SAN Solutions

Data Domain Profiling and Data Masking for Hadoop

CA Workload Automation Agent for Databases

HP Operations Orchestration Software

HP OpenView AssetCenter

HP Application Lifecycle Management

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

HP Operations Orchestration Software

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Cloudera Backup and Disaster Recovery

Word Count Code using MR2 Classes and API

docs.hortonworks.com

Cloudera Manager Training: Hands-On Exercises

HP Business Service Management

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Innovative technology for big data analytics

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

HP Enterprise Integration module for SAP applications

Best Practices for Hadoop Data Analysis with Tableau

HP Device Manager 4.6

HP IMC User Behavior Auditor

The release notes provide details of enhancements and features in Cloudera ODBC Driver for Impala , as well as the version history.

Hadoop Configuration and First Examples

Introduc)on to Map- Reduce. Vincent Leroy

HP Operations Smart Plug-in for Virtualization Infrastructure

HP PolyServe Software upgrade guide

Transcription:

HP Vertica Analytic Database Software Version: 7.1.x Document Release Date: 12/9/2015

Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Restricted Rights Legend Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Copyright Notice Copyright 2006-2015 Hewlett-Packard Development Company, L.P. Trademark Notices Adobe is a trademark of Adobe Systems Incorporated. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. HP Vertica Analytic Database Page 2 of 123

Contents Contents 3 How HP Vertica and Apache Hadoop Work Together 9 HP Vertica's Integrations for Apache Hadoop 9 Cluster Layout 10 Choosing Which Hadoop Interface to Use 12 Using the HP Vertica Connector for Hadoop MapReduce 13 HP Vertica Connector for Hadoop Features 13 Prerequisites 13 Hadoop and HP Vertica Cluster Scaling 14 Installing the Connector 14 Accessing HP Vertica Data From Hadoop 16 Selecting VerticaInputFormat 16 Setting the Query to Retrieve Data From HP Vertica 17 Using a Simple Query to Extract Data From HP Vertica 17 Using a Parameterized Query and Parameter Lists 18 Using a Discrete List of Values 18 Using a Collection Object 18 Scaling Parameter Lists for the Hadoop Cluster 19 Using a Query to Retrieve Parameter Values for a Parameterized Query 19 Writing a Map Class That Processes HP Vertica Data 20 Working with the VerticaRecord Class 20 Writing Data to HP Vertica From Hadoop 22 Configuring Hadoop to Output to HP Vertica 22 Defining the Output Table 22 Writing the Reduce Class 23 Storing Data in the VerticaRecord 24 Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time 27 Specifying the Location of the Connector.jar File 27 HP Vertica Analytic Database Page 3 of 123

Contents Specifying the Database Connection Parameters 27 Parameters for a Separate Output Database 28 Example HP Vertica Connector for Hadoop Map Reduce Application 29 Compiling and Running the Example Application 32 Compiling the Example (optional) 34 Running the Example Application 35 Verifying the Results 36 Using Hadoop Streaming with the HP Vertica Connector for Hadoop Map Reduce 36 Reading Data From HP Vertica in a Streaming Hadoop Job 37 Writing Data to HP Vertica in a Streaming Hadoop Job 39 Loading a Text File From HDFS into HP Vertica 40 Accessing HP Vertica From Pig 42 Registering the HP Vertica.jar Files 42 Reading Data From HP Vertica 42 Writing Data to HP Vertica 43 Using the HP Vertica Connector for HDFS 45 HP Vertica Connector for HDFS Requirements 45 Kerberos Authentication Requirements 46 Testing Your Hadoop WebHDFS Configuration 46 Installing the HP Vertica Connector for HDFS 48 Downloading and Installing the HP Vertica Connector for HDFS Package 48 Loading the HDFS User Defined Source 49 Loading Data Using the HP Vertica Connector for HDFS 50 The HDFS File URL 51 Copying Files in Parallel 52 Viewing Rejected Rows and Exceptions 53 Creating an External Table Based on HDFS Files 53 Load Errors in External Tables 54 Installing and Configuring Kerberos on Your HP Vertica Cluster 55 Installing and Configuring Kerberos on Your HP Vertica Cluster 56 Preparing Keytab Files for the HP Vertica Connector for HDFS 56 HP Vertica Analytic Database (7.1.x) Page 4 of 123

Contents HP Vertica Connector for HDFS Troubleshooting Tips 58 User Unable to Connect to Kerberos-Authenticated Hadoop Cluster 58 Resolving Error 5118 59 Transfer Rate Errors 59 Using the HCatalog Connector 61 Hive, HCatalog, and WebHCat Overview 61 HCatalog Connection Features 61 HCatalog Connection Considerations 62 Reading ORC Files Directly 62 Syntax 63 Supported Data Types 63 Kerberos 63 Query Performance 63 Examples 64 How the HCatalog Connector Works 64 HCatalog Connector Requirements 65 HP Vertica Requirements 65 Hadoop Requirements 65 Testing Connectivity 66 Installing the Java Runtime on Your HP Vertica Cluster 67 Installing a Java Runtime 67 Setting the JavaBinaryForUDx Configuration Parameter 68 Configuring HP Vertica for HCatalog 69 Copy Hadoop JARs and Configuration Files 69 Install the HCatalog Connector 71 Using the HCatalog Connector with HA NameNode 71 Defining a Schema Using the HCatalog Connector 72 Querying Hive Tables Using HCatalog Connector 73 Viewing Hive Schema and Table Metadata 74 Synching an HCatalog Schema With a Local Schema 78 Data Type Conversions from Hive to HP Vertica 79 HP Vertica Analytic Database (7.1.x) Page 5 of 123

Contents Data-Width Handling Differences Between Hive and HP Vertica 80 Using Non-Standard SerDes 80 Determining Which SerDe You Need 81 Installing the SerDe on the HP Vertica Cluster 82 Troubleshooting HCatalog Connector Problems 82 Connection Errors 82 UDx Failure When Querying Data: Error 3399 83 SerDe Errors 84 Differing Results Between Hive and HP Vertica Queries 85 Preventing Excessive Query Delays 85 Using the HP Vertica Storage Location for HDFS 86 Storage Location for HDFS Requirements 86 HDFS Space Requirements 86 Additional Requirements for Backing Up Data Stored on HDFS 87 How the HDFS Storage Location Stores Data 87 What You Can Store on HDFS 88 What HDFS Storage Locations Cannot Do 88 Creating an HDFS Storage Location 88 Creating a Storage Location Using HP Vertica for SQL on Hadoop 90 Adding HDFS Storage Locations to New Nodes 90 Creating a Storage Policy for HDFS Storage Locations 91 Storing an Entire Table in an HDFS Storage Location 91 Storing Table Partitions in HDFS 91 Moving Partitions to a Table Stored on HDFS 93 Backing Up HP Vertica Storage Locations for HDFS 94 Configuring HP Vertica to Restore HDFS Storage Locations 95 Configuration Overview 95 Installing a Java Runtime 96 Finding Your Hadoop Distribution's Package Repository 96 Configuring HP Vertica Nodes to Access the Hadoop Distribution s Package Repository 97 HP Vertica Analytic Database (7.1.x) Page 6 of 123

Contents Installing the Required Hadoop Packages 98 Setting Configuration Parameters 100 Setting Kerberos Parameters 101 Confirming that distcp Runs 101 Troubleshooting 102 Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage 103 Granting Superuser Status on Hortonworks 2.1 103 Granting Superuser Status on Cloudera 5.1 103 Manually Enabling Snapshotting for a Directory 104 Additional Requirements for Kerberos 104 Testing the Database Account's Ability to Make HDFS Directories Snapshottable 105 Performing Backups Containing HDFS Storage Locations 105 Removing HDFS Storage Locations 105 Removing Existing Data from an HDFS Storage Location 106 Moving Data to Another Storage Location 106 Clearing Storage Policies 107 Changing the Usage of HDFS Storage Locations 109 Dropping an HDFS Storage Location 110 Removing Storage Location Files from HDFS 111 Removing Backup Snapshots 111 Removing the Storage Location Directories 112 Troubleshooting HDFS Storage Locations 112 HDFS Storage Disk Consumption 112 Kerberos Authentication When Creating a Storage Location 114 Backup or Restore Fails When Using Kerberos 114 Integrating HP Vertica with the MapR Distribution of Hadoop 116 Using Kerberos with Hadoop 117 How Vertica uses Kerberos With Hadoop 117 User Authentication 117 HP Vertica Authentication 118 See Also 119 HP Vertica Analytic Database (7.1.x) Page 7 of 123

Contents Configuring Kerberos 120 Prerequisite: Setting Up Users and the Keytab File 120 HCatalog Connector 120 HDFS Connector 120 HDFS Storage Location 120 Token Expiration 121 See Also 121 We appreciate your feedback! 123 HP Vertica Analytic Database (7.1.x) Page 8 of 123

How HP Vertica and Apache Hadoop Work Together How HP Vertica and Apache Hadoop Work Together Apache Hadoop is an open-source platform for distributed computing. Like HP Vertica, it harnesses the power of a computer cluster to process data. Hadoop forms an ecosystem that comprises many different components, each of which provides a different data processing capability. The Hadoop components that HP Vertica integrates with are: MapReduce A programming framework for parallel data processing. The Hadoop Distributed Filesystem (HDFS) A fault-tolerant data storage system that takes network structure into account when storing data. Hive A data warehouse that provides the ability to query data stored in Hadoop. HCatalog A component that makes Hive metadata available to applications outside of Hadoop. Some HDFS configurations use Kerberos authentication. HP Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop. While HP Vertica and Apache Hadoop can work together, HP Vertica contains many of the same data processing features as Hadoop. For example, using HP Vertica's flex tables, you can manipulate semi-structured data, which is a task commonly associated with Hadoop. You can also create user-defined extensions (UDxs) using HP Vertica's SDK that perform tasks similar to Hadoop's MapReduce jobs. HP Vertica's Integrations for Apache Hadoop Hewlett-Packard supplies several tools for integrating HP Vertica and Hadoop: The HP Vertica Connector for Apache Hadoop MapReduce lets you create Hadoop MapReduce jobs that retrieve data from HP Vertica. These jobs can also insert data into HP Vertica. The HP Vertica Connector for HDFS lets HP Vertica access data stored in HDFS. The HP Vertica Connector for HCatalog lets HP Vertica query data stored in a Hive database the same way you query data stored natively in an HP Vertica schema. The HP Vertica Storage Location for HDFS lets HP Vertica store data on HDFS. If you are using the HP Vertica for SQL on Hadoop product, this is how all of your data is stored. HP Vertica Analytic Database (7.1.x) Page 9 of 123

Cluster Layout Cluster Layout In the Enterprise Edition product, your HP Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for HP Vertica for SQL on Hadoop, in which HP Vertica nodes are co-located on Hadoop nodes. The following figure illustrates the Enterprise Edition configuration: The network is a key performance component of any well-configured cluster. When HP Vertica stores data to HDFS it writes and reads data across the network. The layout shown in the figure calls for two networks, and there are benefits to adding a third: Database Private Network: HP Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks. Database/Hadoop Shared Network: Each HP Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. HP Vertica and Hadoop should share the dedicated Hadoop network. Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client HP Vertica Analytic Database (7.1.x) Page 10 of 123

Cluster Layout connections to either HP Vertica or Hadoop can improve performance. If the configuration does not support a client network, than client connections should use the shared network. HP Vertica Analytic Database (7.1.x) Page 11 of 123

Choosing Which Hadoop Interface to Use Choosing Which Hadoop Interface to Use HP Vertica provides several ways to interact with data stored in Hadoop. The following table summarizes when you should consider using each: If you want to... Create a MapReduce job that can read data from or write data to HP Vertica....Use this interface HP Vertica Connector for Apache Hadoop MapReduce Create a streaming MapReduce job that reads data from or writes data to HP Vertica Create a Pig script that can read from or store data into HP Vertica. Load structured data from the Hadoop Distributed Filesystem (HDFS) into HP Vertica HP Vertica Connector for HDFS Create an external table based on structured data stored in HDFS Query or explore data stored in Apache Hive without copying it into your database HP Vertica Connector for HCatalog Query data stored in Parquet format Integrate HP Vertica and Hive metadata in a single query If using the HP Vertica for SQL on Hadoop product, query data stored in Optimized Row Columnar (ORC) files in HDFS (faster than HCatalog Connector but does not benefit from Hive data) If using the Enterprise Edition product, store lower-priority data in HDFS (while your higherpriority data is stored locally) Reading ORC Files Directly HP Vertica Storage Location for HDFS If using the HP Vertica for SQL on Hadoop product, store data in the database native format (ROS files) for improved query execution (compared to Hive) If you are using the HP Vertica for SQL on Hadoop product, which is licensed for HDFS data only, there are some additional considerations. See Choosing How to Connect HP Vertica To HDFS. HP Vertica Analytic Database (7.1.x) Page 12 of 123

Using the HP Vertica Connector for Hadoop MapReduce Using the HP Vertica Connector for Hadoop MapReduce The HP Vertica Connector for Hadoop MapReduce lets you create Hadoop MapReduce jobs that can read data from and write data to HP Vertica. You commonly use it when: You need to incorporate data from HP Vertica into your MapReduce job. For example, suppose you are using Hadoop's MapReduce to process web server logs. You may want to access sentiment analysis data stored in HP Vertica using Pulse to try to correlate a website visitor with social media activity. You are using Hadoop MapReduce to refine data on which you want to perform analytics. You can have your MapReduce job directly insert data into HP Vertica where you can analyze it in real time using all of HP Vertica's features. HP Vertica Connector for Hadoop Features The HP Vertica Connector for Hadoop Map Reduce: gives Hadoop access to data stored in HP Vertica. lets Hadoop store its results in HP Vertica. The Connector can create a table for the Hadoop data if it does not already exist. lets applications written in Apache Pig access and store data in HP Vertica. works with Hadoop streaming. The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and HP Vertica nodes communicate with each other directly. Direct connections allow data to be transferred in parallel, dramatically increasing processing speed. The Connector is written in Java, and is compatible with all platforms supported by Hadoop. Note: To prevent Hadoop from potentially inserting multiple copies of data into HP Vertica, the HP Vertica Connector for Hadoop Map Reduce disables Hadoop's speculative execution feature. Prerequisites Before you can use the HP Vertica Connector for Hadoop MapReduce, you must install and configure Hadoop and be familiar with developing Hadoop applications. For details on installing and using Hadoop, please see the Apache Hadoop Web site. HP Vertica Analytic Database (7.1.x) Page 13 of 123

Using the HP Vertica Connector for Hadoop MapReduce See HP Vertica 7.1.x Supported Platforms for a list of the versions of Hadoop and Pig that the connector supports. Hadoop and HP Vertica Cluster Scaling When using the connector for MapReduce, nodes in the Hadoop cluster connect directly to HP Vertica nodes when retrieving or storing data. These direct connections allow the two clusters to transfer large volumes of data in parallel. If the Hadoop cluster is larger than the HP Vertica cluster, this parallel data transfer can negatively impact the performance of the HP Vertica database. To avoid performance impacts on your HP Vertica database, ensure that your Hadoop cluster cannot overwhelm your HP Vertica cluster. The exact sizing of each cluster depends on how fast your Hadoop cluster generates data requests and the load placed on the HP Vertica database by queries from other sources. A good rule of thumb to follow is for your Hadoop cluster to be no larger than your HP Vertica cluster. Installing the Connector Follow these steps to install the HP Vertica Connector for Hadoop Map Reduce: If you have not already done so, download thehp Vertica Connector for Hadoop Map Reduce installation package from the myvertica portal. Be sure to download the package that is compatible with your version of Hadoop. You can find your Hadoop version by issuing the following command on a Hadoop node: # hadoop version You will also need a copy of the HP Vertica JDBC driver which you can also download from the myvertica portal. You need to perform the following steps on each node in your Hadoop cluster: 1. Copy the HP Vertica Connector for Hadoop Map Reduce.zip archive you downloaded to a temporary location on the Hadoop node. 2. Copy the HP Vertica JDBC driver.jar file to the same location on your node. If you haven't already, you can download this driver from the myvertica portal. 3. Unzip the connector.zip archive into a temporary directory. On Linux, you usually use the command unzip. 4. Locate the Hadoop home directory (the directory where Hadoop is installed). The location of this directory depends on how you installed Hadoop (manual install versus a package supplied by your Linux distribution or Cloudera). If you do not know the location of this directory, you can try the following steps: See if the HADOOP_HOME environment variable is set by issuing the command echo $HADOOP_HOME on the command line. HP Vertica Analytic Database (7.1.x) Page 14 of 123

Using the HP Vertica Connector for Hadoop MapReduce See if Hadoop is in your path by typing hadoop classpath on the command line. If it is, this command lists the paths of all the jar files used by Hadoop, which should tell you the location of the Hadoop home directory. If you installed using a.deb or.rpm package, you can look in /usr/lib/hadoop, as this is often the location where these packages install Hadoop. 5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connector archive to the lib subdirectory in the Hadoop home directory. 6. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Hadoop home directory ($HADOOP_HOME/lib). 7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines: # Extra Java CLASSPATH elements. Optional. # export HADOOP_CLASSPATH= Uncomment the export line by removing the hash character (#) and add the absolute path of the JDBC driver file you copied in the previous step. For example: export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica-jdbc-x.x.x.jar This environment variable ensures that Hadoop can find the HP Vertica JDBC driver. 8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environment variable is set to your Java installation. 9. If you want your application written in Pig to be able to access HP Vertica, you need to: a. Locate the Pig home directory. Often, this directory is in the same parent directory as the Hadoop home directory. b. Copy the file named pig-vertica.jar from the directory where you unpacked the connector.zip file to the lib subdirectory in the Pig home directory. c. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Pig home directory. HP Vertica Analytic Database (7.1.x) Page 15 of 123

Using the HP Vertica Connector for Hadoop MapReduce Accessing HP Vertica Data From Hadoop You need to follow three steps to have Hadoop fetch data from HP Vertica: Set the Hadoop job's input format to be HP VerticaInputFormat. Give the HP VerticaInputFormat class a query to be used to extract data from HP Vertica. Create a Mapper class that accepts VerticaRecord objects as input. The following sections explain each of these steps in greater detail. Selecting VerticaInputFormat The first step to reading HP Vertica data from within a Hadoop job is to set its input format. You usually set the input format within the run() method in your Hadoop application's class. To set up the input format, pass the job.setinputformatclass method the VerticaInputFormat.class, as follows: public int run(string[] args) throws Exception { // Set up the configuration and job objects Configuration conf = getconf(); Job job = new Job(conf); (later in the code) // Set the input format to retrieve data from // Vertica. job.setinputformatclass(verticainputformat.class); Setting the input to the VerticaInputFormat class means that the map method will get VerticaRecord objects as its input. HP Vertica Analytic Database (7.1.x) Page 16 of 123

Using the HP Vertica Connector for Hadoop MapReduce Setting the Query to Retrieve Data From HP Vertica A Hadoop job that reads data from your HP Vertica database has to execute a query that selects its input data. You pass this query to your Hadoop application using the setinput method of the VerticaInputFormat class. The HP Vertica Connector for Hadoop Map Reduce sends this query to the Hadoop nodes which then individually connect to HP Vertica nodes to run the query and get their input data. A primary consideration for this query is how it segments the data being retrieved from HP Vertica. Since each node in the Hadoop cluster needs data to process, the query result needs to be segmented between the nodes. There are three formats you can use for the query you want your Hadoop job to use when retrieving input data. Each format determines how the query's results are split across the Hadoop cluster. These formats are: A simple, self-contained query. A parameterized query along with explicit parameters. A parameterized query along with a second query that retrieves the parameter values for the first query from HP Vertica. The following sections explain each of these methods in detail. Using a Simple Query to Extract Data From HP Vertica The simplest format for the query that Hadoop uses to extract data from HP Vertica is a selfcontained hard-coded query. You pass this query in a String to the setinput method of the VerticaInputFormat class. You usually make this call in the run method of your Hadoop job class. For example, the following code retrieves the entire contents of the table named alltypes. // Sets the query to use to get the data from the Vertica database. // Simple query with no parameters VerticaInputFormat.setInput(job, "SELECT * FROM alltypes ORDER BY key;"); The query you supply must have an ORDER BY clause, since the HP Vertica Connector for Hadoop Map Reduce uses it to figure out how to segment the query results between the Hadoop nodes. When it gets a simple query, the connector calculates limits and offsets to be sent to each node in the Hadoop cluster, so they each retrieve a portion of the query results to process. Having Hadoop use a simple query to retrieve data from HP Vertica is the least efficient method, since the connector needs to perform extra processing to determine how the data should be segmented across the Hadoop nodes. HP Vertica Analytic Database (7.1.x) Page 17 of 123

Using the HP Vertica Connector for Hadoop MapReduce Using a Parameterized Query and Parameter Lists You can have Hadoop retrieve data from HP Vertica using a parametrized query, to which you supply a set of parameters. The parameters in the query are represented by a question mark (?). You pass the query and the parameters to the setinput method of the VerticaInputFormat class. You have two options for passing the parameters: using a discrete list, or by using a Collection object. Using a Discrete List of Values To pass a discrete list of parameters for the query, you include them in the setinput method call in a comma-separated list of string values, as shown in the next example: // Simple query with supplied parameters VerticaInputFormat.setInput(job, "SELECT * FROM alltypes WHERE key =?", "1001", "1002", "1003"); The HP Vertica Connector for Hadoop Map Reduce tries to evenly distribute the query and parameters among the nodes in the Hadoop cluster. If the number of parameters is not a multiple of the number of nodes in the cluster, some nodes will get more parameters to process than others. Once the connector divides up the parameters among the Hadoop nodes, each node connects to a host in the HP Vertica database and executes the query, substituting in the parameter values it received. This format is useful when you have a discrete set of parameters that will not change over time. However, it is inflexible because any changes to the parameter list requires you to recompile your Hadoop job. An added limitation is that the query can contain just a single parameter, because the setinput method only accepts a single parameter list. The more flexible way to use parameterized queries is to use a collection to contain the parameters. Using a Collection Object The more flexible method of supplying the parameters for the query is to store them into a Collection object, then include the object in the setinput method call. This method allows you to build the list of parameters at run time, rather than having them hard-coded. You can also use multiple parameters in the query, since you will pass a collection of ArrayList objects to setinput statement. Each ArrayList object supplies one set of parameter values, and can contain values for each parameter in the query. The following example demonstrates using a collection to pass the parameter values for a query containing two parameters. The collection object passed to setinput is an instance of the HashSet class. This object contains four ArrayList objects added within the for loop. This example just adds dummy values (the loop counter and the string "FOUR"). In your own application, you usually calculate parameter values in some manner before adding them to the collection. Note: If your parameter values are stored in HP Vertica, you should specify the parameters HP Vertica Analytic Database (7.1.x) Page 18 of 123

Using the HP Vertica Connector for Hadoop MapReduce using a query instead of a collection. See Using a Query to Retrieve Parameters for a Parameterized Query for details. // Collection to hold all of the sets of parameters for the query. Collection<List<Object>> params = new HashSet<List<Object>>() { }; // Each set of parameters lives in an ArrayList. Each entry // in the list supplies a value for a single parameter in // the query. Here, ArrayList objects are created in a loop // that adds the loop counter and a static string as the // parameters. The ArrayList is then added to the collection. for (int i = 0; i < 4; i++) { ArrayList<Object> param = new ArrayList<Object>(); param.add(i); param.add("four"); params.add(param); } VerticaInputFormat.setInput(job, "select * from alltypes where key =? AND NOT varcharcol =?", params); Scaling Parameter Lists for the Hadoop Cluster Whenever possible, make the number of parameter values you pass to the HP Vertica Connector for Hadoop Map Reduce equal to the number of nodes in the Hadoop cluster because each parameter value is assigned to a single Hadoop node. This ensures that the workload is spread across the entire Hadoop cluster. If you supply fewer parameter values than there are nodes in the Hadoop cluster, some of the nodes will not get a value and will sit idle. If the number of parameter values is not a multiple of the number of nodes in the cluster, Hadoop randomly assigns the extra values to nodes in the cluster. It does not perform scheduling it does not wait for a nodes finish its task and become free before assigning additional tasks. In this case, a node could become a bottleneck if it is assigned the longer-running portions of the job. In addition to the number of parameters in the query, you should make the parameter values yield roughly the same number of results. Ensuring each [parameter yields the same number of results helps prevent a single node in the Hadoop cluster from becoming a bottleneck by having to process more data than the other nodes in the cluster. Using a Query to Retrieve Parameter Values for a Parameterized Query You can pass the HP Vertica Connector for Hadoop Map Reduce a query to extract the parameter values for a parameterized query. This query must return a single column of data that is used as parameters for the parameterized query. To use a query to retrieve the parameter values, supply the VerticaInputFormat class's setinput method with the parameterized query and a query to retrieve parameters. For example: HP Vertica Analytic Database (7.1.x) Page 19 of 123

Using the HP Vertica Connector for Hadoop MapReduce // Sets the query to use to get the data from the Vertica database. // Query using a parameter that is supplied by another query VerticaInputFormat.setInput(job, "select * from alltypes where key =?", "select distinct key from regions"); When it receives a query for parameters, the connector runs the query itself, then groups the results together to send out to the Hadoop nodes, along with the parameterized query. The Hadoop nodes then run the parameterized query using the set of parameter values sent to them by the connector. Writing a Map Class That Processes HP Vertica Data Once you have set up your Hadoop application to read data from HP Vertica, you need to create a Map class that actually processes the data. Your Map class's map method receives LongWritable values as keys and VerticaRecord objects as values. The key values are just sequential numbers that identify the row in the query results. The VerticaRecord class represents a single row from the result set returned by the query you supplied to the VerticaInput.setInput method. Working with the VerticaRecord Class Your map method extracts the data it needs from the VerticaRecord class. This class contains three main methods you use to extract data from the record set: get retrieves a single value, either by index value or by name, from the row sent to the map method. getordinalposition takes a string containing a column name and returns the column's number. gettype returns the data type of a column in the row specified by index value or by name. This method is useful if you are unsure of the data types of the columns returned by the query. The types are stored as integer values defined by the java.sql.types class. The following example shows a Mapper class and map method that accepts VerticaRecord objects. In this example, no real work is done. Instead two values are selected as the key and value to be passed on to the reducer. public static class Map extends Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> { // This mapper accepts VerticaRecords as input. public void map(longwritable key, VerticaRecord value, Context context) throws IOException, InterruptedException { // In your mapper, you would do actual processing here. // This simple example just extracts two values from the row of // data and passes them to the reducer as the key and value. if (value.get(3)!= null && value.get(0)!= null) { context.write(new Text((String) value.get(3)), new DoubleWritable((Long) value.get(0))); HP Vertica Analytic Database (7.1.x) Page 20 of 123

Using the HP Vertica Connector for Hadoop MapReduce } } } If your Hadoop job has a reduce stage, all of the map method output is managed by Hadoop. It is not stored or manipulated in any way by HP Vertica. If your Hadoop job does not have a reduce stage, and needs to store its output into HP Vertica, your map method must output its keys as Text objects and values as VerticaRecord objects. HP Vertica Analytic Database (7.1.x) Page 21 of 123

Using the HP Vertica Connector for Hadoop MapReduce Writing Data to HP Vertica From Hadoop There are three steps you need to take for your Hadoop application to store data in HP Vertica: Set the output value class of your Hadoop job to VerticaRecord. Set the details of the HP Vertica table where you want to store your data in the VerticaOutputFormat class. Create a Reduce class that adds data to a VerticaRecord object and calls its write method to store the data. The following sections explain these steps in more detail. Configuring Hadoop to Output to HP Vertica To tell your Hadoop application to output data to HP Vertica you configure your Hadoop application to output to the HP Vertica Connector for Hadoop Map Reduce. You will normally perform these steps in your Hadoop application's run method. There are three methods that need to be called in order to set up the output to be sent to the connector and to set the output of the Reduce class, as shown in the following example: // Set the output format of Reduce class. It will // output VerticaRecords that will be stored in the // database. job.setoutputkeyclass(text.class); job.setoutputvalueclass(verticarecord.class); // Tell Hadoop to send its output to the Vertica // HP Vertica Connector for Hadoop Map Reduce. job.setoutputformatclass(verticaoutputformat.class); The call to setoutputvalueclass tells Hadoop that the output of the Reduce.reduce method is a VerticaRecord class object. This object represents a single row of an HP Vertica database table. You tell your Hadoop job to send the data to the connector by setting the output format class to VerticaOutputFormat. Defining the Output Table Call the VerticaOutputFormat.setOutput method to define the table that will hold the Hadoop application output: VerticaOutputFormat.setOutput(jobObject, tablename, [truncate, ["columnname1 datatype1" [,"columnnamen datatypen"...]] ); jobobject The Hadoop job object for your application. HP Vertica Analytic Database (7.1.x) Page 22 of 123

Using the HP Vertica Connector for Hadoop MapReduce tablename truncate "columnname1 datatype1" The name of the table to store Hadoop's output. If this table does not exist, the HP Vertica Connector for Hadoop Map Reduce automatically creates it. The name can be a full database.schema.table reference. A Boolean controlling whether to delete the contents of tablename if it already exists. If set to true, any existing data in the table is deleted before Hadoop's output is stored. If set to false or not given, the Hadoop output is added to the existing data in the table. The table column definitions, where columnname1 is the column name and datatype1 the SQL data type. These two values are separated by a space. If not specified, the existing table is used. The first two parameters are required. You can add as many column definitions as you need in your output table. You usually call the setoutput method in your Hadoop class's run method, where all other setup work is done. The following example sets up an output table named mrtarget that contains 8 columns, each containing a different data type: // Sets the output format for storing data in Vertica. It defines the // table where data is stored, and the columns it will be stored in. VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int", "b boolean", "c char(1)", "d date", "f float", "t timestamp", "v varchar", "z varbinary"); If the truncate parameter is set to true for the method call, and target table already exists in the HP Vertica database, the connector deletes the table contents before storing the Hadoop job's output. Note: If the table already exists in the database, and the method call truncate parameter is set to false, the HP Vertica Connector for Hadoop Map Reduce adds new application output to the existing table. However, the connector does not verify that the column definitions in the existing table match those defined in the setoutput method call. If the new application output values cannot be converted to the existing column values, your Hadoop application can throw casting exceptions. Writing the Reduce Class Once your Hadoop application is configured to output data to HP Vertica and has its output table defined, you need to create the Reduce class that actually formats and writes the data for storage in HP Vertica. The first step your Reduce class should take is to instantiate a VerticaRecord object to hold the output of the reduce method. This is a little more complex than just instantiating a base object, since the VerticaRecord object must have the columns defined in it that match the out table's columns (see Defining the Output Table for details). To get the properly configured VerticaRecord object, you pass the constructor the configuration object. HP Vertica Analytic Database (7.1.x) Page 23 of 123

Using the HP Vertica Connector for Hadoop MapReduce You usually instantiate the VerticaRecord object in your Reduce class's setup method, which Hadoop calls before it calls the reduce method. For example: // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Instantiate a VerticaRecord object that has the proper // column definitions. The object is stored in the record // field for use later. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } Storing Data in the VerticaRecord Your reduce method starts the same way any other Hadoop reduce method does it processes its input key and value, performing whatever reduction task your application needs. Afterwards, your reduce method adds the data to be stored in HP Vertica to the VerticaRecord object that was instantiated earlier. Usually you use the set method to add the data: VerticaRecord.set(column, value); column The column to store the value in. This is either an integer (the column number) or a String (the column name, as defined in the table definition). Note: The set method throws an exception if you pass it the name of a column that does not exist. You should always use a try/catch block around any set method call that uses a column name. value The value to store in the column. The data type of this value must match the definition of the column in the table. Note: If you do not have the set method validate that the data types of the value and the column match, the HP Vertica Connector for Hadoop Map Reduce throws a ClassCastException if it finds a mismatch when it tries to commit the data to the database. This exception causes a rollback of the entire result. By having the set method validate the data type of the value, you can catch and resolve the exception before it causes a rollback. In addition to the set method, you can also use the setfromstring method to have the HP Vertica Connector for Hadoop Map Reduce convert the value from String to the proper data type for the column: VerticaRecord.setFromString(column, "value"); column The column number to store the value in, as an integer. HP Vertica Analytic Database (7.1.x) Page 24 of 123

Using the HP Vertica Connector for Hadoop MapReduce value A String containing the value to store in the column. If the String cannot be converted to the correct data type to be stored in the column, setfromstring throws an exception (ParseException for date values, NumberFormatException numeric values). Your reduce method must output a value for every column in the output table. If you want a column to have a null value you must explicitly set it. After it populates the VerticaRecord object, your reduce method calls the Context.write method, passing it the name of the table to store the data in as the key, and the VerticaRecord object as the value. The following example shows a simple Reduce class that stores data into HP Vertica. To make the example as simple as possible, the code doesn't actually process the input it receives, and instead just writes dummy data to the database. In your own application, you would process the key and values into data that you then store in the VerticaRecord object. public static class Reduce extends Reducer<Text, DoubleWritable, Text, VerticaRecord> { // Holds the records that the reducer writes its values to. VerticaRecord record = null; // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Need to call VerticaOutputFormat to get a record object // that has the proper column definitions. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } // The reduce method. public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { // Ensure that the record object has been set up properly. This is // where the results are written. if (record == null) { throw new IOException("No output record found"); } // In this part of your application, your reducer would process the // key and values parameters to arrive at values that you want to // store into the database. For simplicity's sake, this example // skips all of the processing, and just inserts arbitrary values // into the database. // // Use the.set method to store a value in the record to be stored // in the database. The first parameter is the column number, // the second is the value to store. HP Vertica Analytic Database (7.1.x) Page 25 of 123

Using the HP Vertica Connector for Hadoop MapReduce // // Column numbers start at 0. // // Set record 0 to an integer value, you // should always use a try/catch block to catch the exception. try { record.set(0, 125); } catch (Exception e) { // Handle the improper data type here. e.printstacktrace(); } // You can also set column values by name rather than by column // number. However, this requires a try/catch since specifying a // non-existent column name will throw an exception. try { // The second column, named "b", contains a Boolean value. record.set("b", true); } catch (Exception e) { // Handle an improper column name here. e.printstacktrace(); } // Column 2 stores a single char value. record.set(2, 'c'); // Column 3 is a date. Value must be a java.sql.date type. record.set(3, new java.sql.date( Calendar.getInstance().getTimeInMillis())); // You can use the setfromstring method to convert a string // value into the proper data type to be stored in the column. // You need to use a try...catch block in this case, since the // string to value conversion could fail (for example, trying to // store "Hello, World!" in a float column is not going to work). try { record.setfromstring(4, "234.567"); } catch (ParseException e) { // Thrown if the string cannot be parsed into the data type // to be stored in the column. e.printstacktrace(); } // Column 5 stores a timestamp record.set(5, new java.sql.timestamp( Calendar.getInstance().getTimeInMillis())); // Column 6 stores a varchar record.set(6, "example string"); // Column 7 stores a varbinary record.set(7, new byte[10]); // Once the columns are populated, write the record to store // the row into the database. context.write(new Text("mrtarget"), record); HP Vertica Analytic Database (7.1.x) Page 26 of 123

Using the HP Vertica Connector for Hadoop MapReduce } } Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time Specifying the Location of the Connector.jar File Recent versions of Hadoop fail to find the HP Vertica Connector for Hadoop Map Reduce classes automatically, even though they are included in the Hadoop lib directory. Therefore, you need to manually tell Hadoop where to find the connector.jar file using the libjars argument: hadoop jar myhadoopapp.jar com.myorg.hadoop.myhadoopapp \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \... Specifying the Database Connection Parameters You need to pass connection parameters to the HP Vertica Connector for Hadoop Map Reduce when starting your Hadoop application, so it knows how to connect to your database. At a minimum, these parameters must include the list of hostnames in the HP Vertica database cluster, the name of the database, and the user name. The common parameters for accessing the database appear in the following table. Usually, you will only need the basic parameters listed in this table in order to start your Hadoop application. Parameter Description Required Default mapred.vertica.hostnames A comma-separated list of the names or IP addresses of the hosts in the HP Vertica cluster. You should list all of the nodes in the cluster here, since individual nodes in the Hadoop cluster connect directly with a randomly assigned host in the cluster. Yes none The hosts in this cluster are used for both reading from and writing data to the HP Vertica database, unless you specify a different output database (see below). mapred.vertica.port The port number for the HP Vertica database. No 5433 mapred.vertica.database The name of the database the Hadoop application should access. Yes HP Vertica Analytic Database (7.1.x) Page 27 of 123

Using the HP Vertica Connector for Hadoop MapReduce Parameter Description Required Default mapred.vertica.username The username to use when connecting to the database. Yes mapred.vertica.password The password to use when connecting to the database. No empty You pass the parameters to the connector using the -D command line switch in the command you use to start your Hadoop application. For example: hadoop jar myhadoopapp.jar com.myorg.hadoop.myhadoopapp \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.vertica.hostnames=Vertica01,Vertica02,Vertica03,Vertica04 \ -Dmapred.vertica.port=5433 -Dmapred.vertica.username=exampleuser \ -Dmapred.vertica.password=password123 -Dmapred.vertica.database=ExampleDB Parameters for a Separate Output Database The parameters in the previous table are all you need if your Hadoop application accesses a single HP Vertica database. You can also have your Hadoop application read from one HP Vertica database and write to a different HP Vertica database. In this case, the parameters shown in the previous table apply to the input database (the one Hadoop reads data from). The following table lists the parameters that you use to supply your Hadoop application with the connection information for the output database (the one it writes its data to). None of these parameters is required. If you do not assign a value to one of these output parameters, it inherits its value from the input database parameters. Parameter Description Default mapred.vertica.hostnames.output mapred.vertica.port.output A comma-separated list of the names or IP addresses of the hosts in the output HP Vertica cluster. The port number for the output HP Vertica database. Input hostnames 5433 mapred.vertica.database.output The name of the output database. Input database name mapred.vertica.username.output mapred.vertica.password.output The username to use when connecting to the output database. The password to use when connecting to the output database. Input database user name Input database password HP Vertica Analytic Database (7.1.x) Page 28 of 123

Using the HP Vertica Connector for Hadoop MapReduce Example HP Vertica Connector for Hadoop Map Reduce Application This section presents an example of using the HP Vertica Connector for Hadoop Map Reduce to retrieve and store data from an HP Vertica database. The example pulls together the code that has appeared on the previous topics to present a functioning example. This application reads data from a table named alltypes. The mapper selects two values from this table to send to the reducer. The reducer doesn't perform any operations on the input, and instead inserts arbitrary data into the output table named mrtarget. package com.vertica.hadoop; import java.io.ioexception; import java.util.arraylist; import java.util.calendar; import java.util.collection; import java.util.hashset; import java.util.iterator; import java.util.list; import java.math.bigdecimal; import java.sql.date; import java.sql.timestamp; // Needed when using the setfromstring method, which throws this exception. import java.text.parseexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import com.vertica.hadoop.verticaconfiguration; import com.vertica.hadoop.verticainputformat; import com.vertica.hadoop.verticaoutputformat; import com.vertica.hadoop.verticarecord; // This is the class that contains the entire Hadoop example. public class VerticaExample extends Configured implements Tool { public static class Map extends Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> { // This mapper accepts VerticaRecords as input. public void map(longwritable key, VerticaRecord value, Context context) throws IOException, InterruptedException { // In your mapper, you would do actual processing here. // This simple example just extracts two values from the row of // data and passes them to the reducer as the key and value. HP Vertica Analytic Database (7.1.x) Page 29 of 123

Using the HP Vertica Connector for Hadoop MapReduce } } if (value.get(3)!= null && value.get(0)!= null) { context.write(new Text((String) value.get(3)), new DoubleWritable((Long) value.get(0))); } public static class Reduce extends Reducer<Text, DoubleWritable, Text, VerticaRecord> { // Holds the records that the reducer writes its values to. VerticaRecord record = null; // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Need to call VerticaOutputFormat to get a record object // that has the proper column definitions. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } // The reduce method. public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { // Ensure that the record object has been set up properly. This is // where the results are written. if (record == null) { throw new IOException("No output record found"); } // In this part of your application, your reducer would process the // key and values parameters to arrive at values that you want to // store into the database. For simplicity's sake, this example // skips all of the processing, and just inserts arbitrary values // into the database. // // Use the.set method to store a value in the record to be stored // in the database. The first parameter is the column number, // the second is the value to store. // // Column numbers start at 0. // // Set record 0 to an integer value, you // should always use a try/catch block to catch the exception. HP Vertica Analytic Database (7.1.x) Page 30 of 123

Using the HP Vertica Connector for Hadoop MapReduce try { record.set(0, 125); } catch (Exception e) { // Handle the improper data type here. e.printstacktrace(); } // You can also set column values by name rather than by column // number. However, this requires a try/catch since specifying a // non-existent column name will throw an exception. try { // The second column, named "b", contains a Boolean value. record.set("b", true); } catch (Exception e) { // Handle an improper column name here. e.printstacktrace(); } // Column 2 stores a single char value. record.set(2, 'c'); // Column 3 is a date. Value must be a java.sql.date type. record.set(3, new java.sql.date( Calendar.getInstance().getTimeInMillis())); // You can use the setfromstring method to convert a string // value into the proper data type to be stored in the column. // You need to use a try...catch block in this case, since the // string to value conversion could fail (for example, trying to // store "Hello, World!" in a float column is not going to work). try { record.setfromstring(4, "234.567"); } catch (ParseException e) { // Thrown if the string cannot be parsed into the data type // to be stored in the column. e.printstacktrace(); } } } // Column 5 stores a timestamp record.set(5, new java.sql.timestamp( Calendar.getInstance().getTimeInMillis())); // Column 6 stores a varchar record.set(6, "example string"); // Column 7 stores a varbinary record.set(7, new byte[10]); // Once the columns are populated, write the record to store // the row into the database. context.write(new Text("mrtarget"), record); @Override public int run(string[] args) throws Exception { // Set up the configuration and job objects HP Vertica Analytic Database (7.1.x) Page 31 of 123

Using the HP Vertica Connector for Hadoop MapReduce Configuration conf = getconf(); Job job = new Job(conf); conf = job.getconfiguration(); conf.set("mapreduce.job.tracker", "local"); job.setjobname("vertica test"); // Set the input format to retrieve data from // Vertica. job.setinputformatclass(verticainputformat.class); // Set the output format of the mapper. This is the interim // data format passed to the reducer. Here, we will pass in a // Double. The interim data is not processed by Vertica in any // way. job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(doublewritable.class); // Set the output format of the Hadoop application. It will // output VerticaRecords that will be stored in the // database. job.setoutputkeyclass(text.class); job.setoutputvalueclass(verticarecord.class); job.setoutputformatclass(verticaoutputformat.class); job.setjarbyclass(verticaexample.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); } // Sets the output format for storing data in Vertica. It defines the // table where data is stored, and the columns it will be stored in. VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int", "b boolean", "c char(1)", "d date", "f float", "t timestamp", "v varchar", "z varbinary"); // Sets the query to use to get the data from the Vertica database. // Query using a list of parameters. VerticaInputFormat.setInput(job, "select * from alltypes where key =?", "1", "2", "3"); job.waitforcompletion(true); return 0; } public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new VerticaExample(), args); System.exit(res); } Compiling and Running the Example Application To run the example Hadoop application, you first need to set up the alltypes table that the example reads as input. To set up the input table, save the following Perl script as MakeAllTypes.pl to a location on one of your HP Vertica nodes: #!/usr/bin/perl open FILE, ">datasource" or die $!; HP Vertica Analytic Database (7.1.x) Page 32 of 123

Using the HP Vertica Connector for Hadoop MapReduce for ($i=0; $i < 10; $i++) { print FILE $i. " ". rand(10000); print FILE " one ONE 1 1999-01-08 1999-02-23 03:11:52.35"; print FILE ' 1999-01-08 07:04:37 07:09:23 15:12:34 EST 0xabcd '; print FILE '0xabcd 1234532 03:03:03'. qq(\n); } close FILE; Then follow these steps: 1. Connect to the node where you saved the MakeAllTypes.pl file. 2. Run the MakeAllTypes.pl file. This will generate a file named datasource in the current directory. Note: If your node does not have Perl installed, can can run this script on a system that does have Perl installed, then copy the datasource file to a database node. 3. On the same node, use vsql to connect to your HP Vertica database. 4. Run the following query to create the alltypes table: CREATE TABLE alltypes (key identity,intcol integer, floatcol float, charcol char(10), varcharcol varchar, boolcol boolean, datecol date, timestampcol timestamp, timestamptzcol timestamptz, timecol time, timetzcol timetz, varbincol varbinary, bincol binary, numcol numeric(38,0), intervalcol interval ); 5. Run the following query to load data from the datasource file into the table: COPY alltypes COLUMN OPTION (varbincol FORMAT 'hex', bincol FORMAT 'hex') FROM '/path-to-datasource/datasource' DIRECT; Replace path-to-datasource with the absolute path to the datasource file located in the same directory where you ran MakeAllTypes.pl. HP Vertica Analytic Database (7.1.x) Page 33 of 123

Using the HP Vertica Connector for Hadoop MapReduce Compiling the Example (optional) The example code presented in this section is based on example code distributed along with the HP Vertica Connector for Hadoop Map Reduce in the file hadoop-vertica-example.jar. If you just want to run the example, skip to the next section and use the hadoop-vertica-example.jar file that came as part of the connector package rather than a version you compiled yourself. To compile the example code listed in Example HP Vertica Connector for Hadoop Map Reduce Application, follow these steps: 1. Log into a node on your Hadoop cluster. 2. Locate the Hadoop home directory. See Installing the Connector for tips on how to find this directory. 3. If it is not already set, set the environment variable HADOOP_HOME to the Hadoop home directory: export HADOOP_HOME=path_to_Hadoop_home If you installed Hadoop using an.rpm or.deb package, Hadoop is usually installed in /usr/lib/hadoop: export HADOOP_HOME=/usr/lib/hadoop 4. Save the example source code to a file named VerticaExample.java on your Hadoop node. 5. In the same directory where you saved VerticaExample.java, create a directory named classes. On Linux, the command is: mkdir classes 6. Compile the Hadoop example: javac -classpath \ $HADOOP_HOME/hadoop-core.jar:$HADOOP_HOME/lib/hadoop-vertica.jar \ -d classes VerticaExample.java \ && jar -cvf hadoop-vertica-example.jar -C classes. Note: If you receive errors about missing Hadoop classes, check the name of the hadoop-code.jar file. Most Hadoop installers (including the Cloudera) create a symbolic link named hadoop-core.jar to a version specific.jar file (such as hadoop-core- 0.20.203.0.jar). If your Hadoop installation did not create this link, you will have to supply the.jar file name with the version number. HP Vertica Analytic Database (7.1.x) Page 34 of 123

Using the HP Vertica Connector for Hadoop MapReduce When the compilation finishes, you will have a file named hadoop-vertica-example.jar in the same directory as the VerticaExample.java file. This is the file you will have Hadoop run. Running the Example Application Once you have compiled the example, run it using the following command line: hadoop jar hadoop-vertica-example.jar \ com.vertica.hadoop.verticaexample \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \ -Dmapred.vertica.port=portNumber \ -Dmapred.vertica.username=userName \ -Dmapred.vertica.password=dbPassword \ -Dmapred.vertica.database=databaseName This command tells Hadoop to run your application's.jar file, and supplies the parameters needed for your application to connect to your HP Vertica database. Fill in your own values for the hostnames, port, user name, password, and database name for your HP Vertica database. After entering the command line, you will see output from Hadoop as it processes data that looks similar to the following: 12/01/11 10:41:19 INFO mapred.jobclient: Running job: job_201201101146_0005 12/01/11 10:41:20 INFO mapred.jobclient: map 0% reduce 0% 12/01/11 10:41:36 INFO mapred.jobclient: map 33% reduce 0% 12/01/11 10:41:39 INFO mapred.jobclient: map 66% reduce 0% 12/01/11 10:41:42 INFO mapred.jobclient: map 100% reduce 0% 12/01/11 10:41:45 INFO mapred.jobclient: map 100% reduce 22% 12/01/11 10:41:51 INFO mapred.jobclient: map 100% reduce 100% 12/01/11 10:41:56 INFO mapred.jobclient: Job complete: job_201201101146_0005 12/01/11 10:41:56 INFO mapred.jobclient: Counters: 23 12/01/11 10:41:56 INFO mapred.jobclient: Job Counters 12/01/11 10:41:56 INFO mapred.jobclient: Launched reduce tasks=1 12/01/11 10:41:56 INFO mapred.jobclient: SLOTS_MILLIS_MAPS=21545 12/01/11 10:41:56 INFO mapred.jobclient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/11 10:41:56 INFO mapred.jobclient: Total time spent by all maps waiting after reserving slots (ms)=0 12/01/11 10:41:56 INFO mapred.jobclient: Launched map tasks=3 12/01/11 10:41:56 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES=13851 12/01/11 10:41:56 INFO mapred.jobclient: File Output Format Counters 12/01/11 10:41:56 INFO mapred.jobclient: Bytes Written=0 12/01/11 10:41:56 INFO mapred.jobclient: FileSystemCounters 12/01/11 10:41:56 INFO mapred.jobclient: FILE_BYTES_READ=69 12/01/11 10:41:56 INFO mapred.jobclient: HDFS_BYTES_READ=318 12/01/11 10:41:56 INFO mapred.jobclient: FILE_BYTES_WRITTEN=89367 12/01/11 10:41:56 INFO mapred.jobclient: File Input Format Counters 12/01/11 10:41:56 INFO mapred.jobclient: Bytes Read=0 12/01/11 10:41:56 INFO mapred.jobclient: Map-Reduce Framework 12/01/11 10:41:56 INFO mapred.jobclient: Reduce input groups=1 12/01/11 10:41:56 INFO mapred.jobclient: Map output materialized bytes=81 12/01/11 10:41:56 INFO mapred.jobclient: Combine output records=0 12/01/11 10:41:56 INFO mapred.jobclient: Map input records=3 12/01/11 10:41:56 INFO mapred.jobclient: Reduce shuffle bytes=54 12/01/11 10:41:56 INFO mapred.jobclient: Reduce output records=1 HP Vertica Analytic Database (7.1.x) Page 35 of 123

Using the HP Vertica Connector for Hadoop MapReduce 12/01/11 10:41:56 INFO mapred.jobclient: Spilled Records=6 12/01/11 10:41:56 INFO mapred.jobclient: Map output bytes=57 12/01/11 10:41:56 INFO mapred.jobclient: Combine input records=0 12/01/11 10:41:56 INFO mapred.jobclient: Map output records=3 12/01/11 10:41:56 INFO mapred.jobclient: SPLIT_RAW_BYTES=318 12/01/11 10:41:56 INFO mapred.jobclient: Reduce input records=3 Note: The version of the example supplied in the Hadoop Connector download package will produce more output, since it runs several input queries. Verifying the Results Once your Hadoop application finishes, you can verify it ran correctly by looking at the mrtarget table in your HP Vertica database: Connect to your HP Vertica database using vsql and run the following query: => SELECT * FROM mrtarget; The results should look like this: a b c d f t v z -----+---+---+------------+---------+-------------------------+----------------+--------- --------------------------------- 125 t c 2012-01-11 234.567 2012-01-11 10:41:48.837 example string \000\000\000\000\000\000\000\000\000\000 (1 row) Using Hadoop Streaming with the HP Vertica Connector for Hadoop Map Reduce Hadoop streaming allows you to create an ad-hoc Hadoop job that uses standard commands (such as UNIX command-line utilities) for its map and reduce processing. When using streaming, Hadoop executes the command you pass to it a mapper and breaks each line from its standard output into key and value pairs. By default, the key and value are separated by the first tab character in the line. These values are then passed to the standard input to the command that you specified as the reducer. See the Hadoop wiki's topic on streaming for more information. You can have a streaming job retrieve data from an HP Vertica database, store data into an HP Vertica database, or both. HP Vertica Analytic Database (7.1.x) Page 36 of 123

Using the HP Vertica Connector for Hadoop MapReduce Reading Data From HP Vertica in a Streaming Hadoop Job To have a streaming Hadoop job read data from an HP Vertica database, you set the inputformat argument of the Hadoop command line to com.vertica.deprecated.verticastreaminginput. You also need to supply parameters that tell the Hadoop job how to connect to your HP Vertica database. See Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time for an explanation of these command-line parameters. Note: The VerticaStreamingInput class is within the deprecated namespace because the current version of Hadoop (as of 0.20.1) has not defined a current API for streaming. Instead, the streaming classes conform to the Hadoop version 0.18 API. In addition to the standard command-line parameters that tell Hadoop how to access your database, there are additional streaming-specific parameters you need to use that supply Hadoop with the query it should use to extract data from HP Vertica and other query-related options. Parameter Description Required Default mapred.vertica.input.query The query to use to retrieve data from the HP Vertica database. See Setting the Query to Retrieve Data from HP Vertica for more information. Yes none mapred.vertica.input.paramquery A query to execute to retrieve parameters for the query given in the.input.query parameter. If query has parameter and no discrete parameters supplied mapred.vertica.query.params Discrete list of parameters for the query. If query has parameter and no parameter query supplied mapred.vertica.input.delimiter The character to use for separating column values. The command you use as a mapper needs to split individual column values apart using this delimiter. No 0xa mapred.vertica.input.terminator The character used to signal the end of a row of data from the query result. No 0xb The following command demonstrates reading data from a table named alltypes. This command uses the UNIX cat command as a mapper and reducer so it will just pass the contents through. HP Vertica Analytic Database (7.1.x) Page 37 of 123

Using the HP Vertica Connector for Hadoop MapReduce $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.input.query="SELECT key, intcol, floatcol, varcharcol FROM alltypes ORDER BY key" \ -Dmapred.vertica.input.delimiter=, \ -Dmapred.map.tasks=1 \ -inputformat com.vertica.hadoop.deprecated.verticastreaminginput \ -input /tmp/input -output /tmp/output -reducer /bin/cat -mapper /bin/cat The results of this command are saved in the /tmp/output directory on your HDFS filesystem. On a four-node Hadoop cluster, the results would be: # $HADOOP_HOME/bin/hadoop dfs -ls /tmp/output Found 5 items drwxr-xr-x - release supergroup 0 2012-01-19 11:47 /tmp/output/_logs -rw-r--r-- 3 release supergroup 88 2012-01-19 11:47 /tmp/output/part-00000 -rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00001 -rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00002 -rw-r--r-- 3 release supergroup 87 2012-01-19 11:47 /tmp/output/part-00003 # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00000 1 2,1,3165.75558015273,ONE, 5 6,5,1765.76024139635,ONE, 9 10,9,4142.54176256463,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00001 2 3,2,8257.77313710329,ONE, 6 7,6,7267.69718012601,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00002 3 4,3,443.188765520475,ONE, 7 8,7,4729.27825566408,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00003 0 1,0,2456.83076632307,ONE, 4 5,4,9692.61214265391,ONE, 8 9,8,3327.25019418294,ONE,13 1015,15,15.1515,FIFTEEN 2 1003,3,333.0,THREE 3 1004,4,0.0,FOUR 4 1005,5,0.0,FIVE 5 1007,7,0.0,SEVEN 6 1008,8,1.0E36,EIGHT 7 1009,9,-1.0E36,NINE 8 1010,10,0.0,TEN 9 1011,11,11.11,ELEVEN Notes Even though the input is coming from HP Vertica, you need to supply the -input parameter to Hadoop for it to process the streaming job. HP Vertica Analytic Database (7.1.x) Page 38 of 123

Using the HP Vertica Connector for Hadoop MapReduce The -Dmapred.map.tasks=1 parameter prevents multiple Hadoop nodes from reading the same data from the database, which would result in Hadoop processing multiple copies of the data. Writing Data to HP Vertica in a Streaming Hadoop Job Similar to reading from a streaming Hadoop job, you write data to HP Vertica by setting the outputformat parameter of your Hadoop command to com.vertica.deprecated.verticastreamingoutput. This class requires key/value pairs, but the keys are ignored. The values passed to VerticaStreamingOutput are broken into rows and inserted into a target table. Since keys are ignored, you can use the keys to partition the data for the reduce phase without affecting HP Vertica's data transfer. As with reading from HP Vertica, you need to supply parameters that tell the streaming Hadoop job how to connect to the database. See Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time for an explanation of these command-line parameters. If you are reading data from one HP Vertica database and writing to another, you need to use the output parameters, similarly if you were reading and writing to separate databases using a Hadoop application. There are also additional parameters that configure the output of the streaming Hadoop job, listed in the following table. Parameter Description Required Default mapred.vertica.output.table.name The name of the table where Hadoop should store its data. Yes none mapred.vertica.output.table.def The definition of the table. The format is the same as used for defining the output table for a Hadoop application. See Defining the Output Table for details. If the table does not already exist in the database mapred.vertica.output.table.drop Whether to truncate the table before adding data to it. No false mapred.vertica.output.delimiter The character to use for separating column values. No 0x7 (ASCII bell character) mapred.vertica.output.terminator The character used to signal the end of a row of data.. No 0x8 (ASCII backspace) The following example demonstrates reading two columns of data data from an HP Vertica database table named alltypes and writing it back to the same database in a table named hadoopout. The command provides the definition for the table, so you do not have to manually create the table beforehand. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ HP Vertica Analytic Database (7.1.x) Page 39 of 123

Using the HP Vertica Connector for Hadoop MapReduce -Dmapred.vertica.output.table.name=hadoopout \ -Dmapred.vertica.output.table.def="intcol integer, varcharcol varchar" \ -Dmapred.vertica.output.table.drop=true \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.input.query="SELECT intcol, varcharcol FROM alltypes ORDER BY key" \ -Dmapred.vertica.input.delimiter=, \ -Dmapred.vertica.output.delimiter=, \ -Dmapred.vertica.input.terminator=0x0a \ -Dmapred.vertica.output.terminator=0x0a \ -inputformat com.vertica.hadoop.deprecated.verticastreaminginput \ -outputformat com.vertica.hadoop.deprecated.verticastreamingoutput \ -input /tmp/input \ -output /tmp/output \ -reducer /bin/cat \ -mapper /bin/cat After running this command, you can view the result by querying your database: => SELECT * FROM hadoopout; intcol varcharcol --------+------------ 1 ONE 5 ONE 9 ONE 2 ONE 6 ONE 0 ONE 4 ONE 8 ONE 3 ONE 7 ONE (10 rows) Loading a Text File From HDFS into HP Vertica One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes. Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig. For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters ( ), with each line of the file is terminated by a carriage return: HP Vertica Analytic Database (7.1.x) Page 40 of 123

Using the HP Vertica Connector for Hadoop MapReduce # $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt 1 1.0 ONE 2 2.0 TWO 3 3.0 THREE In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data. Below is an example of a mapper script written in Python. It performs two tasks: Strips the carriage returns from the input text and terminates each line with a tilde (~). Adds a key value (the string "streaming") followed by a tab character at the start of each line of the text file. The mapper script needs to do this because the streaming job to read text files skips the reducer stage. The reducer isn't necessary, since the all of the data being read in text file should be stored in the HP Vertica tables. However, VerticaStreamingOutput class requires key and values pairs, so the mapper script adds the key. #!/usr/bin/python import sys for line in sys.stdin.readlines(): # Get rid of carriage returns. # CR is used as the record terminator by Streaming.jar line = line.strip(); # Add a key. The key value can be anything. # The convention is to use the name of the # target table, as shown here. sys.stdout.write("streaming\t%s~\n" % line) The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper script appears below. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.reduce.tasks=0 \ -Dmapred.vertica.output.table.name=streaming \ -Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol varchar" \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.output.delimiter=" " \ -Dmapred.vertica.output.terminator="~" \ -input /tmp/textdata.txt \ -output output \ -mapper "python path-to-script/mapper.py" \ -outputformat com.vertica.hadoop.deprecated.verticastreamingoutput HP Vertica Analytic Database (7.1.x) Page 41 of 123

Using the HP Vertica Connector for Hadoop MapReduce Notes The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does not need a reducer since the mapper script processes the data into the format that the VerticaStreamingOutput class expects. Even though the VerticaStreamingOutput class is handling the output from the mapper, you need to supply a valid output directory to the Hadoop command. The result of running the command is a new table in the HP Vertica database: => SELECT * FROM streaming; intcol floatcol varcharcol --------+----------+------------ 3 3 THREE 1 1 ONE 2 2 TWO (3 rows) Accessing HP Vertica From Pig The HP Vertica Connector for Hadoop Map Reduce includes a Java package that lets you access an HP Vertica database using Pig. You must copy this.jar to somewhere in your Pig installation's CLASSPATH (see Installing the Connector for details). Registering the HP Vertica.jar Files Before it can access HP Vertica, your Pig Latin script must register the HP Vertica-related.jar files. All of your Pig scripts should start with the following commands: REGISTER 'path-to-pig-home/lib/vertica-jdbc-7.1..x.jar'; REGISTER 'path-to-pig-home/lib/pig-vertica.jar'; These commands ensure that Pig can locate the HP Vertica JDBC classes, as well as the interface for the connector. Reading Data From HP Vertica To read data from an HP Vertica database, you tell Pig Latin's LOAD statement to use a SQL query and to use the VerticaLoader class as the load function. Your query can be hard coded, or contain a parameter. See Setting the Query to Retrieve Data from HP Vertica for details. Note: You can only use a discrete parameter list or supply a query to retrieve parameter values you cannot use a collection to supply parameter values as you can from within a Hadoop application. The format for calling the VerticaLoader is: HP Vertica Analytic Database (7.1.x) Page 42 of 123

Using the HP Vertica Connector for Hadoop MapReduce com.vertica.pig.verticaloader('hosts','database','port','username','password'); hosts database port username password A comma-separated list of the hosts in the HP Vertica cluster. The name of the database to be queried. The port number for the database. The username to use when connecting to the database. The password to use when connecting to the database. This is the only optional parameter. If not present, the connector assumes the password is empty. The following Pig Latin command extracts all of the data from the table named alltypes using a simple query: A = LOAD 'sql://{select * FROM alltypes ORDER BY key}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03', 'ExampleDB','5433','ExampleUser','password123'); This example uses a parameter and supplies a discrete list of parameter values: A = LOAD 'sql://{select * FROM alltypes WHERE key =?};{1,2,3}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03', 'ExampleDB','5433','ExampleUser','password123'); This final example demonstrates using a second query to retrieve parameters from the HP Vertica database. A = LOAD 'sql://{select * FROM alltypes WHERE key =?};sql://{select DISTINCT key FROM alltypes}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03','exampledb', '5433','ExampleUser','password123'); Writing Data to HP Vertica To write data to an HP Vertica database, you tell Pig Latin's STORE statement to save data to a database table (optionally giving the definition of the table) and to use the VerticaStorer class as the save function. If the table you specify as the destination does not exist, and you supplied the table definition, the table is automatically created in your HP Vertica database and the data from the relation is loaded into it. The syntax for calling the VerticaStorer is the same as calling VerticaLoader: com.vertica.pig.verticastorer('hosts','database','port','username','password'); The following example demonstrates saving a relation into a table named hadoopout which must already exist in the database: HP Vertica Analytic Database (7.1.x) Page 43 of 123

Using the HP Vertica Connector for Hadoop MapReduce STORE A INTO '{hadoopout}' USING com.vertica.pig.verticastorer('vertica01,vertica02,vertica03','exampledb','5433', 'ExampleUser','password123'); This example shows how you can add a table definition to the table name, so that the table is created in HP Vertica if it does not already exist: STORE A INTO '{outtable(a int, b int, c float, d char(10), e varchar, f boolean, g date, h timestamp, i timestamptz, j time, k timetz, l varbinary, m binary, n numeric(38,0), o interval)}' USING com.vertica.pig.verticastorer('vertica01,vertica02,vertica03','exampledb','5433', 'ExampleUser','password123'); Note: If the table already exists in the database, and the definition that you supply differs from the table's definition, the table is not dropped and recreated. This may cause data type errors when data is being loaded. HP Vertica Analytic Database (7.1.x) Page 44 of 123

Using the HP Vertica Connector for HDFS Using the HP Vertica Connector for HDFS The Hadoop Distributed File System (HDFS) is the location where Hadoop usually stores its input and output files. It stores files across the Hadoop cluster redundantly, to keep the files available even if some nodes are down. HDFS also makes Hadoop more efficient, by spreading file access tasks across the cluster to help limit I/O bottlenecks. The HP Vertica Connector for HDFS lets you load files from HDFS into HP Vertica using the COPY statement. You can also create external tables that access data stored on HDFS as if it were a native HP Vertica table. The connector is useful if your Hadoop job does not directly store its data in HP Vertica using the HP Vertica Connector for Hadoop Map Reduce (see Using the HP Vertica Connector for Hadoop MapReduce) or if you use HDFS to store files and want to process them using HP Vertica. Note: The files you load from HDFS using the HP Vertica Connector for HDFS usually have a delimited format. Column values are separated by a character, such as a comma or a pipe character ( ). This format is the same type used in other files you load with the COPY statement. Hadoop MapReduce jobs often output tab-delimited files. Like the HP Vertica Connector for Hadoop Map Reduce, the HP Vertica Connector for HDFS takes advantage of the distributed nature of both HP Vertica and Hadoop. Individual nodes in the HP Vertica cluster connect directly to nodes in the Hadoop cluster when you load multiple files from the HDFS. Hadoop splits large files into file segments that it stores on different nodes. The connector directly retrieves these file segments from the node storing them, rather than relying on the Hadoop cluster to reassemble the file. The connector is read-only it cannot write data to HDFS. The HP Vertica Connector for HDFS can connect to a Hadoop cluster through unauthenticated and Kerberos-authenticated connections. HP Vertica Connector for HDFS Requirements The HP Vertica Connector for HDFS connects to the Hadoop file system using WebHDFS, a builtin component of HDFS that provides access to HDFS files to applications outside of Hadoop. WebHDFS was added to Hadoop in version 1.0, so your Hadoop installation must be version 1.0 or later.. In addition, the WebHDFS system must be enabled. See your Hadoop distribution's documentation for instructions on configuring and enabling WebHDFS. Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS. It relies on a separate server process that receives requests for files and retrieves them from the HDFS. Since it uses a REST API that is compatible with WebHDFS, it could theoretically work with the connector. However, the connector has not been tested with HTTPfs and HP HP Vertica Analytic Database (7.1.x) Page 45 of 123

Using the HP Vertica Connector for HDFS does not support using the HP Vertica Connector for HDFS with HTTPfs. In addition, since all of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficient than WebHDFS, which lets HP Vertica nodes directly connect to the Hadoop nodes storing the file blocks. Kerberos Authentication Requirements The HP Vertica Connector for HDFS can connect to the Hadoop file system using Kerberos authentication. To use Kerberos, your connector must meet these additional requirements: Your HP Vertica installation must be Kerberos-enabled. Your Hadoop cluster must be configured to use Kerberos authentication. Your connector must be able to connect to the Kerberos-enabled Hadoop Cluster. The Kerberos server must be running version 5. The Kerberos server must be accessible from every node in your HP Vertica cluster. You must have Kerberos principals (users) that map to Hadoop users. You use these principals to authenticate your HP Vertica users with the Hadoop cluster. Before you can use the HP Vertica Connector for HDFS with Kerberos you must Install the Kerberos client and libraries on your HP Vertica cluster. Testing Your Hadoop WebHDFS Configuration To ensure that your Hadoop installation's WebHDFS system is configured and running, follow these steps: 1. Log into your Hadoop cluster and locate a small text file on the Hadoop filesystem. If you do not have a suitable file, you can create a file named test.txt in the /tmp directory using the following command: echo -e "A 1 2 3\nB 4 5 6" hadoop fs -put - /tmp/test.txt 2. Log into a host in your HP Vertica database using the database administrator account. 3. If you are using Kerberos authentication, authenticate with the Kerberos server using the keytab file for a user who is authorized to access the file. For example, to authenticate as an user named exampleuser@mycompany.com, use the command: $ kinit exampleuser@mycompany.com -k -t /path/exampleuser.keytab HP Vertica Analytic Database (7.1.x) Page 46 of 123

Using the HP Vertica Connector for HDFS Where path is the path to the keytab file you copied over to the node. You do not receive any message if you authenticate successfully. You can verify that you are authenticated by using the klist command: $ klistticket cache: FILE:/tmp/krb5cc_500 Default principal: exampleuser@mycompany.com Valid starting Expires Service principal 07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/mycompany.com@mycompany.com renew until 07/24/13 14:30:19 4. Test retrieving the file: If you are not using Kerberos authentication, run the following command from the Linux command line: curl -i -L "http:// hadoopnamenode:50070/webhdfs/v1/tmp/test.txt?op=open&user.name=hadoopusername" Replacing hadoopnamenode with the hostname or IP address of the name node in your Hadoop cluster, /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1, and hadoopusername with the user name of a Hadoop user that has read access to the file. If successful, the command produces output similar to the following: HTTP/1.1 200 OKServer: Apache-Coyote/1.1 Set-Cookie: hadoop.auth="u=hadoopuser&p=password&t=simple&e=1344383263490&s=n8yb/chfg56qnmrqrt qo0idrmve="; Version=1; Path=/ Content-Type: application/octet-stream Content-Length: 16 Date: Tue, 07 Aug 2012 13:47:44 GMT A 1 2 3 B 4 5 6 If you are using Kerberos authentication, run the following command from the Linux command line: curl --negotiate -i -L -u:anyuser http://hadoopnamenode:50070/webhdfs/v1/tmp/test.txt?op=open Replace hadoopnamenode with the hostname or IP address of the name node in your Hadoop cluster, and /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1. If successful, the command produces output similar to the following: HP Vertica Analytic Database (7.1.x) Page 47 of 123

Using the HP Vertica Connector for HDFS HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8 WWW-Authenticate: Negotiate Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream Expires: Thu, 01-Jan-1970 00:00:00 GMT Set-Cookie: hadoop.auth="u=exampleuser&p=exampleuser@mycompany.com&t=kerberos& e=1375144834763&s=iy52irvjuuoz5iyg8g5g12o2vwo=";path=/ Location: http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.t xt? op=open&delegation=jaahcmvszwfzzqdyzwxlyxnlaiobqcrfpdgkaubo7cnrju3tbbslid_ osb658jfgf RpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0 Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 16 Server: Jetty(6.1.26) A 1 2 3 B 4 5 6 If the curl command fails, you must review the error messages and resolve any issues before using the HP Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include: Verify the HDFS service's port number. Verify that the Hadoop user you specified exists and has read access to the file you are attempting to retrieve. Installing the HP Vertica Connector for HDFS The HP Vertica Connector for HDFS is not included as part of the HP Vertica Server installation. You must download it from my.vertica.com and install it on all nodes in your HP Vertica database. The connector installation packages contains several support libraries in addition to the library for the connector. Unlike some other packages supplied by HP, you need to install these package on all of the hosts in your HP Vertica database so each host has a copy of the support libraries. Downloading and Installing the HP Vertica Connector for HDFS Package Following these steps to install the HP Vertica Connector for HDFS: 1. Use a Web browser to log into the myvertica portal. 2. Click the Download tab. HP Vertica Analytic Database (7.1.x) Page 48 of 123

Using the HP Vertica Connector for HDFS 3. Locate the section for the HP Vertica Connector for HDFS that you want and download the installation package that matches the Linux distribution on your HP Vertica cluster. 4. Copy the installation package to each host in your database. 5. Log into each host as root and run the following command to install the connector package. On Red Hat-based Linux distributions, use the command: rpm -Uvh /path/installation-package.rpm For example, if you downloaded the Red Hat installation package to the dbadmin home directory, the command to install the package is: rpm -Uvh /home/dbadmin/vertica-hdfs-connectors-7.1..x86_64.rhel5.rpm On Debian-based systems, use the command: dpkg -i /path/installation-package.deb Once you have installed the connector package on each host, you need to load the connector library into HP Vertica. See Loading the HDFS User Defined Source for instructions. Loading the HDFS User Defined Source Once you have installed the HP Vertica Connector for HDFS package on each host in your database, you need to load the connector's library into HP Vertica and define the User Defined Source (UDS) in the HP Vertica catalog. The UDS is component you use to access data from the HDFS. The connector install package contains a SQL script named install.sql that performs these steps for you. To run it: 1. Log into to an HP Vertica host using the database administrator account. 2. Execute the installation script: vsql -f /opt/vertica/packages/hdfs_connectors/install.sql 3. Enter the database administrator password if prompted. Note: You only need to run an installation script once in order to create the User Defined Source in the HP Vertica catalog. You do not need to run the install script on each node. HP Vertica Analytic Database (7.1.x) Page 49 of 123

Using the HP Vertica Connector for HDFS The SQL install script loads the HP Vertica Connector for HDFS library and defines the HDFS User Defined Source named HDFS. The output of successfully running the installation script looks like this: version ------------------------------------------- Vertica Analytic Database v7.1.x (1 row) CREATE LIBRARY CREATE SOURCE FUNCTION Once the install script finishes running, the connector is ready to use. Loading Data Using the HP Vertica Connector for HDFS After you install the HP Vertica Connector for HDFS, you can use the HDFS User Defined Source (UDS) in a COPY statement to load data from HDFS files. The syntax for using the HDFS UDS in a COPY statement is: COPY tablename SOURCE Hdfs(url='WebHDFSFileURL', [username='username'], [low_speed_limit=speed]); tablename WebHDFSFileURL The name of the table to receive the copied data. A string containing one or more URLs that identify the file or files to be read. See below for details. Use commas to separate multiple URLs. Important: You must replace any commas in the URLs with the escape sequence %2c. For example, if you are loading a file named doe,john.txt, change the file's name in the URL to doe%2cjohn.txt. username The username of a Hadoop user that has permissions to access the files you want to copy. If you are using Kerberos, omit this argument. HP Vertica Analytic Database (7.1.x) Page 50 of 123

Using the HP Vertica Connector for HDFS speed The minimum data transmission rate, expressed in bytes per second, that the connector allows. The connector breaks any connection between the Hadoop and HP Vertica clusters that transmits data slower than this rate for more than 1 minute. After the connector breaks a connection for being too slow, it attempts to connect to another node in the Hadoop cluster. This new connection can supply the data that the broken connection was retrieving. The connector terminates the COPY statement and returns an error message if: The HDFS File URL It cannot find another Hadoop node to supply the data. The previous transfer attempts from all other Hadoop nodes that have the file also closed because they were too slow. Default Value: 1048576 (1MB per second transmission rate) The url parameter in the Hdfs function call is a string containing one or more comma-separated HTTP URLs. These URLS identify the files in HDFS that you want to load. The format for each URL in this string is: http://namenode:port/webhdfs/v1/hdfsfilepath NameNode Port webhdfs/v1/ HDFSFilePath The host name or IP address of the Hadoop cluster's name node. The port number on which the WebHDFS service is running. This number is usually 50070 or 14000, but may be different in your Hadoop installation. The protocol being used to retrieve the file. This portion of the URL is always the same. It tells Hadoop to use version 1 of the WebHDFS API. The path from the root of the HDFS filesystem to the file or files you want to load. This path can contain standard Linux wildcards. Important: Any wildcards you use to specify multiple input files must resolve to files only. They must not include any directories. For example, if you specify the path /user/hadoopuser/output/*, and the output directory contains a subdirectory, the connector returns an error message. The following example shows how to use the HP Vertica Connector for HDFS to load a single file named /tmp/test.txt. The Hadoop cluster's name node is named hadoop. => COPY testtable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt', username='hadoopuser'); Rows Loaded ------------- 2 HP Vertica Analytic Database (7.1.x) Page 51 of 123

Using the HP Vertica Connector for HDFS (1 row) Copying Files in Parallel The basic COPY statement in the previous example copies a single file. It runs on just a single host in the HP Vertica cluster because the Connector cannot break up the workload among nodes. Any data load that does not take advantage of all nodes in the HP Vertica cluster is inefficient. To make loading data from HDFS more efficient, spread the data across multiple files on HDFS. This approach is often natural for data you want to load from HDFS. Hadoop MapReduce jobs usually store their output in multiple files. You specify multiple files to be loaded in your Hdfs function call by: Using wildcards in the URL Supplying multiple comma-separated URLs in the url parameter of the Hdfs user-defined source function call Supplying multiple comma-separated URLs that contain wildcards Loading multiple files through the HP Vertica Connector for HDFS results in a efficient load. The HP Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If Hadoop has broken files into multiple chunks, the HP Vertica hosts directly connect to the nodes storing each chunk. The following example shows how to load all of the files whose filenames start with "part-" located in the /user/hadoopuser/output directory on the HDFS. If there are at least as many files in this directory as there are nodes in the HP Vertica cluster, all nodes in the cluster load data from the HDFS. => COPY Customers SOURCE-> Hdfs (url='http://hadoop:50070/webhdfs/v1/user/hadoopuser/output/part-*', username='hadoopuser'); Rows Loaded ------------- 40008 (1 row) To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the URL string: => COPY Customers SOURCE-> Hdfs (url='http://hadoop:50070/webhdfs/v1/user/hadoopuser/output/part-*, http://hadoop:50070/webhdfs/v1/user/anotheruser/part-*', username='h=hadoopuser'); Rows Loaded ------------- 80016 (1 row) HP Vertica Analytic Database (7.1.x) Page 52 of 123

Using the HP Vertica Connector for HDFS Note: HP Vertica statements must be less than 65,000 characters long. If you supply too many long URLs in a single statement, you could go over this limit. Normally, you would only approach this limit if you are automatically generating of the COPY statement using a program or script. Viewing Rejected Rows and Exceptions COPY statements that use the HP Vertica Connector for HDFS use the same method for recording rejections and exceptions as other COPY statements. Rejected rows and exceptions are saved to log files. These log files are stored by default in the CopyErrorLogs subdirectory in the database's catalog directory. Due to the distributed nature of the HP Vertica Connector for HDFS, you cannot use the ON option to force all exception and rejected row information to be written to log files on a single HP Vertica host. Instead, you need to collect the log files from across the hosts to review all of the exceptions and rejections generated by the COPY statement. For more about handling rejected rows, see Capturing Load Rejections and Exceptions. Creating an External Table Based on HDFS Files You can use the HP Vertica Connector for HDFS as a source for an external table that lets you directly perform queries on the contents of files on the Hadoop Distributed File System (HDFS). See Using External Tables in the Administrator's Guide for more information on external tables. Using an external table to access data stored on an HDFS is useful when you need to extract data from files that are periodically updated, or have additional files added on the HDFS. It saves you from having to drop previously loaded data and then reload the data using a COPY statement. The external table always accesses the current version of the files on the HDFS. Note: An external table performs a bulk load each time it is queried. Its performance is significantly slower than querying an internal HP Vertica table. You should only use external tables for infrequently-run queries (such as daily reports). If you need to frequently query the content of the HDFS files, you should either use COPY to load the entire content of the files into HP Vertica or save the results of a query run on an external table to an internal table which you then use for repeated queries. To create an external table that reads data from HDFS, you use the Hdfs User Defined Source (UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of this statement has the same format as the COPY statement used to load data from HDFS. See Loading Data Using the HP Vertica Connector for HDFS for more information. The following simple example shows how to create an external table that extracts data from every file in the /user/hadoopuser/example/output directory using the HP Vertica Connector for HDFS. => CREATE EXTERNAL TABLE hadoopexample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE Hdfs(url= -> 'http://hadoop01:50070/webhdfs/v1/user/hadoopuser/example/output/*', HP Vertica Analytic Database (7.1.x) Page 53 of 123

Using the HP Vertica Connector for HDFS -> username='hadoopuser'); CREATE TABLE => SELECT * FROM hadoopexample; A B C D -------+---+---+--- test1 1 2 3 test1 3 4 5 (2 rows) Later, after another Hadoop job adds contents to the output directory, querying the table produces different results: => SELECT * FROM hadoopexample; A B C D -------+----+----+---- test3 10 11 12 test3 13 14 15 test2 6 7 8 test2 9 0 10 test1 1 2 3 test1 3 4 5 (6 rows) Load Errors in External Tables Normally, querying an external table on HDFS does not produce any errors if rows rejected by the underlying COPY statement (for example, rows containing columns whose contents are incompatible with the data types in the table). Rejected rows are handled the same way they are in a standard COPY statement: they are written to a rejected data file, and are noted in the exceptions file. For more information on how COPY handles rejected rows and exceptions, see Capturing Load Rejections and Exceptions in the Administrator's Guide. Rejections and exception files are created on all of the nodes that load data from the HDFS. You cannot specify a single node to receive all of the rejected row and exception information. These files are created on each HP Vertica node as they process files loaded through the HP Vertica Connector for HDFS. Note: Since the the connector is read-only, there is no way to store rejection and exception information on the HDFS. Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS) do not occur until you query the external table. The following example shows what happens if you recreate the table based on a file that does not exist on HDFS. => DROP TABLE hadoopexample; DROP TABLE => CREATE EXTERNAL TABLE hadoopexample (A INTEGER, B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt', -> username='hadoopuser'); HP Vertica Analytic Database (7.1.x) Page 54 of 123

Using the HP Vertica Connector for HDFS CREATE TABLE => SELECT * FROM hadoopexample; ERROR 0: Error calling plan() in User Function HdfsFactory at [src/hdfs.cpp:222], error code: 0, message: No files match [http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt] Note that it is not until you actually query the table that the the connector attempts to read the file. Only then does it return an error. Installing and Configuring Kerberos on Your HP Vertica Cluster You must perform several steps to configure your HP Vertica cluster before the HP Vertica Connector for HDFS can authenticate an HP Vertica user using Kerberos. Note: You only need to perform these configuration steps if you are using the connector with Kerberos. In a non-kerberos environment, the connector does not require your HP Vertica cluster to have any special configuration. Perform the following steps on each node in your HP Vertica cluster: 1. Install the Kerberos libraries and client. To learn how to install these packages, see the documentation for your Linux distribution. On some Red Hat and CentOS version, you can install these packages by executing the following commands as root: yum install krb5-libs krb5-workstation On some versions of Debian, you would use the command: apt-get install krb5-config krb5-user krb5-clients 2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your Kerberos Key Distribution Center (KDC) server. 3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for the HP Vertica Connector for HDFS). The absolute path to these files must be the same on every node in your HP Vertica cluster. 4. Ensure the keytab files are readable by the database administrator's Linux account (usually dbadmin). The easiest way to do this is to change ownership of the files to dbadmin: sudo chown dbadmin *.keytab HP Vertica Analytic Database (7.1.x) Page 55 of 123

Using the HP Vertica Connector for HDFS Installing and Configuring Kerberos on Your HP Vertica Cluster You must perform several steps to configure your HP Vertica cluster before the HP Vertica Connector for HDFS can authenticate an HP Vertica user using Kerberos. Note: You only need to perform these configuration steps if you are using the connector with Kerberos. In a non-kerberos environment, the connector does not require your HP Vertica cluster to have any special configuration. Perform the following steps on each node in your HP Vertica cluster: 1. Install the Kerberos libraries and client. To learn how to install these packages, see the documentation for your Linux distribution. On some Red Hat and CentOS version, you can install these packages by executing the following commands as root: yum install krb5-libs krb5-workstation On some versions of Debian, you would use the command: apt-get install krb5-config krb5-user krb5-clients 2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your Kerberos Key Distribution Center (KDC) server. 3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for the HP Vertica Connector for HDFS). The absolute path to these files must be the same on every node in your HP Vertica cluster. 4. Ensure the keytab files are readable by the database administrator's Linux account (usually dbadmin). The easiest way to do this is to change ownership of the files to dbadmin: sudo chown dbadmin *.keytab Preparing Keytab Files for the HP Vertica Connector for HDFS The HP Vertica Connector for HDFS uses keytab files to authenticate HP Vertica users with Kerberos, so they can access files on the Hadoop HDFS filesystem. These files take the place of entering a password at a Kerberos login prompt. HP Vertica Analytic Database (7.1.x) Page 56 of 123

Using the HP Vertica Connector for HDFS You must have a keytab file for each HP Vertica user that needs to use the connector. The keytab file must match the Kerberos credentials of a Hadoop user that has access to the HDFS. Caution: Exporting a keytab file for a user changes the version number associated with the user's Kerberos account. This change invalidates any previously exported keytab file for the user. If a keytab file has already been created for a user and is currently in use, you should use that keytab file rather than exporting a new keytab file. Otherwise, exporting a new keytab file will cause any processes using an existing keytab file to no longer be able to authenticate. To export a keytab file for a user: 1. Start the kadmin utility: If you have access to the root account on the Kerberos Key Distribution Center (KDC) system, log into it and use the kadmin.local command. (If this command is not in the system search path, try /usr/kerberos/sbin/kadmin.local.) If you do not have access to the root account on the Kerberos KDC, then you can use the kadmin command from a Kerberos client system as long as you have the password for the Kerberos administrator account. When you start kadmin, it will prompt you for the Kerberos administrator's password. You may need to have root privileges on the client system in order to run kadmin. 2. Use kadmin's xst (export) command to export the user's credentials to a keytab file: xst -k username.keytab username@yourdomain.com where username is the name of Kerberos principal you want to export, and YOURDOMAIN.COM is your site's Kerberos realm. This command creates a keytab file named username.keytab in the current directory. The following example demonstrates exporting a keytab file for a user named exampleuser@mycompany.com using the kadmin command on a Kerberos client system: $ sudo kadmin [sudo] password for dbadmin: Authenticating as principal root/admin@mycompany.com with password. Password for root/admin@mycompany.com: kadmin: xst -k exampleuser.keytab exampleuser@mycompany.com Entry for principal exampleuser@mycompany.com with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:exampleuser.keytab. Entry for principal exampleuser@verticacorp.com with kvno 2, encryption type des-cbc-crc added to keytab WRFILE:exampleuser.keytab. After exporting the keyfile, you can use the klist command to list the keys stored in the file: $ sudo klist -k exampleuser.keytab [sudo] password for dbadmin: Keytab name: FILE:exampleuser.keytab HP Vertica Analytic Database (7.1.x) Page 57 of 123

Using the HP Vertica Connector for HDFS KVNO Principal ---- ----------------------------------------------------------------- 2 exampleuser@mycompany.com 2 exampleuser@mycompany.com HP Vertica Connector for HDFS Troubleshooting Tips The following sections explain some of the common issues you may encounter when using the HP Vertica Connector for HDFS. User Unable to Connect to Kerberos-Authenticated Hadoop Cluster A user may suddenly be unable to connect to Hadoop through the connector in a Kerberos-enabled environment. This issue can be caused by someone exporting a new keytab file for the user, which invalidates existing keytab files. You can determine if invalid keytab files is the problem by comparing the key version number associated with the user's principal key in Kerberos with the key version number stored in the keytab file on the HP Vertica cluster. To find the key version number for a user in Kerberos: 1. From the Linux command line, start the kadmin utility (kadmin.local if you are logged into the Kerberos Key Distribution Center). Run the getprinc command for the user: $ sudo kadmin [sudo] password for dbadmin: Authenticating as principal root/admin@mycompany.com with password. Password for root/admin@mycompany.com: kadmin: getprinc exampleuser@mycompany.com Principal: exampleuser@mycompany.com Expiration date: [never] Last password change: Fri Jul 26 09:40:44 EDT 2013 Password expiration date: [none] Maximum ticket life: 1 day 00:00:00 Maximum renewable life: 0 days 00:00:00 Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/admin@mycompany.com) Last successful authentication: [never] Last failed authentication: [never] Failed password attempts: 0 Number of keys: 2 Key: vno 3, des3-cbc-sha1, no salt Key: vno 3, des-cbc-crc, no salt MKey: vno 0 Attributes: Policy: [none] HP Vertica Analytic Database (7.1.x) Page 58 of 123

Using the HP Vertica Connector for HDFS In the preceding example, there are two keys stored for the user, both of which are at version number (vno) 3. 2. To get the version numbers of the keys stored in the keytab file, use the klist command: $ sudo klist -ek exampleuser.keytab Keytab name: FILE:exampleuser.keytab KVNO Principal ---- ---------------------------------------------------------------------- 2 exampleuser@mycompany.com (des3-cbc-sha1) 2 exampleuser@mycompany.com (des-cbc-crc) 3 exampleuser@mycompany.com (des3-cbc-sha1) 3 exampleuser@mycompany.com (des-cbc-crc) The first column in the output lists the key version number. In the preceding example, the keytab includes both key versions 2 and 3, so the keytab file can be used to authenticate the user with Kerberos. Resolving Error 5118 When using the connector, you may receive an error message similar to the following: ERROR 5118: UDL specified no execution nodes; at least one execution node must be specified To correct this error, verify that all of the nodes in your HP Vertica cluster have the correct version of the HP Vertica Connector for HDFS package installed. This error can occur if one or more of the nodes do not have the supporting libraries installed. These libraries may be missing because one of the nodes was skipped when initially installing the connector package. Another possibility is that one or more nodes have been added since the connector was installed. Transfer Rate Errors The HP Vertica Connector for HDFS monitors how quickly Hadoop sends data to HP Vertica.In some cases, the data transfer speed on any connection between a node in your Hadoop cluster and a node in your HP Vertica cluster slows beyond a lower limit (by default, 1 MB per second). When the transfer slows beyond the lower limit, the connector breaks the data transfer. It then connects to another node in the Hadoop cluster that contains the data it was retrieving. If it cannot find another node in the Hadoop cluster to supply the data (or has already tried all of the nodes in the Hadoop cluster), the connector terminates the COPY statement and returns an error. => COPY messages SOURCE Hdfs (url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser'); ERROR 3399: Failure in UDx RPC call InvokeProcessUDL(): Error calling processudl() in User Defined Object [Hdfs] at [src/hdfs.cpp:275], error code: 0, message: [Transferring rate during last 60 seconds is 172655 byte/s, below threshold 1048576 byte/s, give up. HP Vertica Analytic Database (7.1.x) Page 59 of 123

Using the HP Vertica Connector for HDFS The last error message: Operation too slow. Less than 1048576 bytes/sec transferred the last 1 seconds. The URL: http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt?op=open&offset=154901544&length=1 13533912. The redirected URL: http://hadoop.example.com:50075/webhdfs/v1/tmp/data.txt?op=open& namenoderpcaddress=hadoop.example.com:8020&length=113533912&offset=154901544.] If you encounter this error, troubleshoot the connection between your HP Vertica and Hadoop clusters. If there are no problems with the network, determine if either your Hadoop cluster or HP Vertica cluster is overloaded. If the nodes in either cluster are too busy, they may not be able to maintain the minimum data transfer rate. If you cannot resolve the issue causing the slow transfer rate, you can lower the minimum acceptable speed. To do so, set the low_speed_limit parameter for the Hdfs source. The following example shows how to set low_speed_limit to 524288 to accept transfer rates as low as 512 KB per second (half the default lower limit). => COPY messages SOURCE Hdfs (url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser', low_speed_limit=524288); Rows Loaded ------------- 9891287 (1 row) When you lower the low_speed_limit parameter, the COPY statement loading data from HDFS may take a long time to complete. You can also increase the low_speed_limit setting if the network between your Hadoop cluster and HP Vertica cluster is fast. You can choose to increase the lower limit to force COPY statements to generate an error, if they are running more slowly than they normally should, given the speed of the network. HP Vertica Analytic Database (7.1.x) Page 60 of 123

Using the HCatalog Connector Using the HCatalog Connector The HP Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native HP Vertica table. If your files are in the Optimized Columnar Row (ORC) format, you might be able to read them directly instead of going through this connector. For more information, see Reading ORC Files Directly. Hive, HCatalog, and WebHCat Overview There are several Hadoop components that you need to understand in order to use the HCatalog connector: Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig). WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema. HP Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in HP Vertica that corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to HP Vertica. You can even perform operations such as joins between HP Vertica-native tables and HCatalog tables. For more details, see How the HCatalog Connector Works. HCatalog Connection Features The HCatalog Connector lets you query data stored in Hive using the HP Vertica native SQL syntax. Some of its main features are: The HCatalog Connector always reflects the current state of data stored in Hive. The HCatalog Connector uses the parallel nature of both HP Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive. HP Vertica Analytic Database (7.1.x) Page 61 of 123

Using the HCatalog Connector Since HP Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster. The data you query through the HCatalog Connector can be used as if it were native HP Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table. HCatalog Connection Considerations There are a few things to keep in mind when using the HCatalog Connector: Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than HP Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native HP Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into HP Vertica through the HCatalog Connector or the WebHDFS connector. HP Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive. Hive supports complex data types such as lists, maps, and structs that HP Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data Type Conversions from Hive to HP Vertica. Note: The HCatalog Connector is read only. It cannot insert data into Hive. Reading ORC Files Directly If your HDFS data is in the Optimized Row Columnar (ORC) format and uses no complex data types, then instead of using the HCatalog Connector you can use the ORC Reader to access the data directly. Reading directly may provide better performance. The decisions you make when writing ORC files can affect performance when using them. To get the best performance from the ORC Reader, do the following when writing: Use the latest available Hive version to write ORC files. (You can still read them with earlier versions.) Use a large stripe size; 256MB or greater is preferred. Partition the data at the table level. Sort the columns based on frequency of access, most-frequent first. Use Snappy or ZLib compression. HP Vertica Analytic Database (7.1.x) Page 62 of 123

Using the HCatalog Connector Syntax In the COPY statement, specify a format of ORC as follows: COPY tablename FROM path ORC; In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC as follows: CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path ORC; If the file resides on the local file system of the node where you are issuing the command, use a local file path for path. If the file resides elsewhere in HDFS, use the webhdfs:// prefix and then specify the host name, port, and file path. Use ON ANY NODE for files that are not local to improve performance. COPY t FROM 'webhdfs://somehost:port/opt/data/orcfile' ON ANY NODE ORC; The ORC reader supports ZLib and Snappy compression. The CREATE EXTERNAL TABLE AS COPY statement must consume all of the columns in the ORC file; unlike with some other data sources, you cannot select only the columns of interest. If you omit columns the ORC reader aborts with an error and does not copy any data. If you load from multiple ORC files in the same COPY statement and any of them is aborted, the entire load is aborted. This is different behavior than for delimited files, where the COPY statement loads what it can and ignores the rest. Supported Data Types The HP Vertica ORC file reader can natively read columns of all data types supported in HIVE version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The ORC reader does not attempt to read only some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types. Kerberos If the ORC file is located on an HDFS cluster that uses Kerberos authentication, HP Vertica uses the current user's principal to authenticate. It does not use the database's principal. Query Performance When working with external tables in ORC format, HP Vertica tries to improve performance in two ways: by pushing query execution closer to the data so less has to be read and transmitted, and by taking advantage of data locality in planning the query. HP Vertica Analytic Database (7.1.x) Page 63 of 123

Using the HCatalog Connector Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics and row group indexes. Predicates are applied only to the first two levels. Predicate pushdown works and is automatically applied for ORC files written with HIVE version 0.14 and later. ORC files written with earlier versions of HIVE might not contain the required statistics. When executing a query against an ORC file that lacks these statistics, HP Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_ EVENTS system table. If you are seeing performance problems with your queries, check this table for these events. In a cluster where HP Vertica nodes are co-located on HDFS nodes, the query can also take advantage of data locality. If data is on an HDFS node where a database node is also present, and if the query is not restricted to specific nodes using ON NODE, then the query planner uses that database node to read that data. This allows HP Vertica to read data locally instead of making a network call. You can see how much ORC data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with colocated Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data. Examples The following example shows how to read from all ORC files in a directory. It uses all supported data types. CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5)) AS COPY FROM '/data/orc_test_*.orc' ORC; The following example shows the error that is produced if the file you specify is not recognized as an ORC file: CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC; ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the HP Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the HP Vertica cluster directly retrieve the data necessary for completing the query from the Hadoop HDFS. They then use the Hive SerDe classes to extract the data so the query can execute. HP Vertica Analytic Database (7.1.x) Page 64 of 123

Using the HCatalog Connector This approach takes advantage of the parallel nature of both HP Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster. HCatalog Connector Requirements Before you can use the HCatalog Connector, both your HP Vertica and Hadoop installations must meet the following requirements. HP Vertica Requirements All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the Java Runtime on Your HP Vertica Cluster. You must also add certain libraries distributed with Hadoop and Hive to your HP Vertica installation directory. See Configuring HP Vertica for HCatalog. Hadoop Requirements Your Hadoop cluster must meet several requirements to operate correctly with the HP Vertica Connector for HCatalog: It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more information. HP Vertica Analytic Database (7.1.x) Page 65 of 123

Using the HCatalog Connector It must have WebHCat (formerly known as Templeton) installed and running. See Apache' s WebHCat page for details. The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your HP Vertica database. Verify that any firewall separating the Hadoop cluster and the HP Vertica cluster will pass WebHCat, metastore database, and HDFS traffic. The data that you want to query must be in an internal or external Hive table. If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your HP Vertica cluster before you can query the data. See Using Non-Standard SerDes. Testing Connectivity To test the connection between your database cluster and WebHcat, log into a node in your HP Vertica cluster. Then, run the following command to execute an HCatalog query: $ curl http://webhcatserver:port/templeton/v1/status?user.name=hcatusername Where: webhcatserver is the IP address or hostname of the WebHCat server port is the port number assigned to the WebHCat service (usually 50111) hcatusername is a valid username authorized to use HCatalog Usually, you want to append ;echo to the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read. For example: $ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo If there are no errors, this command returns a status message in JSON format, similar to the following: {"status":"ok","version":"v1"} This result indicates that WebHCat is running and that the HP Vertica host can connect to it and retrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and the connectivity between your Hadoop and HP Vertica clusters. For details, see Troubleshooting HCatalog Connector Problems. You can also run some queries to verify that WebHCat is correctly configured to work with Hive. The following example demonstrates listing the databases defined in Hive and the tables defined within a database: HP Vertica Analytic Database (7.1.x) Page 66 of 123

Using the HCatalog Connector $ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo {"databases":["default","production"]} $ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo {"tables":["messages","weblogs","tweets","transactions"],"database":"default"} See Apache's WebHCat reference for details about querying Hive using WebHCat. Installing the Java Runtime on Your HP Vertica Cluster The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes. Note: If your HP Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See Developing User Defined Functions in Java in the Extending HP Vertica Guide for more information. Installing Java on your HP Vertica cluster is a two-step process: 1. Install a Java runtime on all of the hosts in your cluster. 2. Set the JavaBinaryForUDx configuration parameter to tell HP Vertica the location of the Java executable. Installing a Java Runtime For Java-based features, HP Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. HP Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE. Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK. To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions. Once you have installed a JVM on each host, ensure that the java command is in the search path and calls the correct JVM by running the command: $ java -version This command should print something similar to: java version "1.6.0_37"Java(TM) SE Runtime Environment (build 1.6.0_37-b06) HP Vertica Analytic Database (7.1.x) Page 67 of 123

Using the HCatalog Connector Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode) Note: Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime. Setting the JavaBinaryForUDx Configuration Parameter The JavaBinaryForUDx configuration parameter tells HP Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell: $ which java /usr/bin/java If the java command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default (which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java. You set the configuration parameter by executing the following statement as a database superuser: => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; See ALTER DATABASE for more information on setting configuration parameters. To view the current setting of the configuration parameter, query the CONFIGURATION_ PARAMETERS system table: => \x Expanded display is on. => SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx'; -[ RECORD 1 ]-----------------+---------------------------------------------------------- node_name ALL parameter_name JavaBinaryForUDx current_value /usr/bin/java default_value change_under_support_guidance f change_requires_restart f description Path to the java binary for executing UDx written in Java Once you have set the configuration parameter, HP Vertica can find the Java executable on each node in your cluster. HP Vertica Analytic Database (7.1.x) Page 68 of 123

Using the HCatalog Connector Note: Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster. Configuring HP Vertica for HCatalog Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your HP Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. HP Vertica uses the values in those configuration files to make its own connections to Hadoop. You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector. Copy Hadoop JARs and Configuration Files HP Vertica provides a tool, hcatutil, to collect the required files from Hadoop. This tool was introduced in version 7.1.1. (In a previous version you were required to copy these files manually.) This tool copies selected jars and XML configuration files. Note: If you plan to use HIVE to query files that use Snappy compression, you also need access to the Snappy native libraries. Either include the path(s) for libhadoop*.so and libsnappy*.so in the path you specify for --hcatlibpath or copy these files to a directory on that path before beginning. In order to use this tool you need access to the Hadoop files, which are found on nodes in the Hadoop cluster. If HP Vertica is not co-located on a Hadoop node, you should do the following: 1. Copy /opt/vertica/packages/hcat/tools/hcatutil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory. Use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/tmp/hadoop-files 2. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/tmp/hadoop-files HP Vertica Analytic Database (7.1.x) Page 69 of 123

Using the HCatalog Connector 3. Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the HP Vertica node you will connect to when installing the HCatalog connector. If you are updating a HP Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar. 4. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/opt/vertica/packages/hcat If you are using the HP Vertica for SQL on Hadoop product with co-located clusters, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/opt/vertica/packages/hcat/lib The hcatutil script has the following arguments: -c, --copyjars -v, --verifyjars --hadoophivehome= "value1;value2;..." copy the required JARs from hadoophivepath to hcatlibpath. verify that the required JARs are present in hcatlibpath. paths to the Hadoop, Hive, and HCatalog home directories. Separate multiple paths by a semicolon (;). Enclose paths in double quotes. In newer versions of Hadoop, look for the HCatalog directory under the HIVE directory (for example, /hive/hcatalog/share). -- hcatlibpath= "value1;value2;..." -- hadoophiveconfpath= "value" output path of the lib/ folder of the HCatalog dependency JARs. Usually this is /opt/vertica/packages/hcat. You may use any folder, but make sure to copy all JARs to the hcat/lib folder before installing the HCatalog connector. If you have previously run hcatutil with a different version of Hadoop, remove the old JAR files first (all except verticahcatalogudl.jar). paths of the Hadoop, HIVE, and other components' configuration files (such as core-site.xml, hive-site.xml, and webhcat-site.xml). Separate multiple paths by a semicolon (;). Enclose paths in double quotes. These files contain values that would otherwise have to be specified to CREATE HCATALOG SCHEMA. If you are using Cloudera, or if your HDFS cluster uses Kerberos authentication, this parameter is required. Otherwise this parameter is optional. Once you have copied the files and verified them, install the HCatalog connector. HP Vertica Analytic Database (7.1.x) Page 70 of 123

Using the HCatalog Connector Install the HCatalog Connector On the same node where you copied the files from hcatutil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. This script creates the library and VHCatSource and VHCatParser. Note: The data that was copied using hcatutil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatutil and install.sql. The following statement returns the names of the libraries and configuration files currently being used: => SELECT dependencies FROM user_libraries WHERE lib_name='vhcataloglib'; Now you can create HCatalog schema parameters, which point to your existing Hadoop/Hive/WebHCat services, as described in Defining a Schema Using the HCatalog Connector. Using the HCatalog Connector with HA NameNode Newer distributions of Hadoop support the High Availability NameNode (HA NN) for HDFS access. Some additional configuration is required to use this feature with the HCatalog Connector. If you do not perform this configuration, attempts to retrieve data through the connector will produce an error. To use HA NN with HP Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in your HP Vertica cluster. You can put this directory anywhere, but it must be in the same location on every node. (In the example below it is in /opt/hcat/hadoop_conf.) Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall the library: => \i /opt/vertica/packages/hcat/ddl/uninstall.sql DROP LIBRARY => ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf'; WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated; setting it has no effect => \i /opt/vertica/packages/hcat/ddl/install.sql CREATE LIBRARY CREATE SOURCE FUNCTION GRANT PRIVILEGE CREATE PARSER FUNCTION GRANT PRIVILEGE Despite the warning message, this step is necessary. After taking these steps, HCatalog queries will now work. HP Vertica Analytic Database (7.1.x) Page 71 of 123

Using the HCatalog Connector Defining a Schema Using the HCatalog Connector After you set up the HCatalog Connector, you can use it to define a schema in your HP Vertica database to access the tables in a Hive database. You define the schema using the CREATE HCATALOG SCHEMA statement. See CREATE HCATALOG SCHEMA in the SQL Reference Manual for a full description. When creating the schema, you must supply at least two pieces of information: the name of the schema to define in HP Vertica the host name or IP address of Hive's metastore database (the database server that contains metadata about Hive's data, such as the schema and table definitions) Other parameters are optional. If you do not supply a value, HP Vertica uses default values. After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native HP Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. See Viewing Hive Schema and Table Metadata for more information about these tables. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser'; CREATE SCHEMA => -- Show list of all HCatalog schemas => \x Expanded display is on. => SELECT * FROM v_catalog.hcatalog_schemata; -[ RECORD 1 ]--------+------------------------------ schema_id 45035996273748980 schema_name hcat schema_owner_id 45035996273704962 schema_owner dbadmin create_time 2013-11-04 15:09:03.504094-05 hostname hcathost port 9933 webservice_hostname hcathost webservice_port 50111 hcatalog_schema_name default hcatalog_user_name hcatuser metastore_db_name hivemetastoredb => -- List the tables in all HCatalog schemas => SELECT * FROM v_catalog.hcatalog_table_list; -[ RECORD 1 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat hcatalog_schema default table_name messages hcatalog_user_name hcatuser -[ RECORD 2 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat HP Vertica Analytic Database (7.1.x) Page 72 of 123

Using the HCatalog Connector hcatalog_schema default table_name weblog hcatalog_user_name hcatuser -[ RECORD 3 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat hcatalog_schema default table_name tweets hcatalog_user_name hcatuser Querying Hive Tables Using HCatalog Connector Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query. => SELECT * from hcat.messages limit 10; messageid userid time message -----------+------------+---------------------+---------------------------------- 1 npfq1ayhi 2013-10-29 00:10:43 hymenaeos cursus lorem Suspendis 2 N7svORIoZ 2013-10-29 00:21:27 Fusce ad sem vehicula morbi 3 4VvzN3d 2013-10-29 00:32:11 porta Vivamus condimentum 4 heojkmtmc 2013-10-29 00:42:55 lectus quis imperdiet 5 corows3of 2013-10-29 00:53:39 sit eleifend tempus a aliquam mauri 6 odrp1i 2013-10-29 01:04:23 risus facilisis sollicitudin sceler 7 AU7a9Kp 2013-10-29 01:15:07 turpis vehicula tortor 8 ZJWg185DkZ 2013-10-29 01:25:51 sapien adipiscing eget Aliquam tor 9 E7ipAsYC3 2013-10-29 01:36:35 varius Cum iaculis metus 10 kstcv 2013-10-29 01:47:19 aliquam libero nascetur Cum mal (10 rows) Since the tables you access through the HCatalog Connector act like HP Vertica tables, you can perform operations that use both Hive data and native HP Vertica data, such as a join: => SELECT u.firstname, u.lastname, d.time, d.message from UserData u -> JOIN hcat.messages d ON u.userid = d.userid LIMIT 10; FirstName LastName time Message ----------+----------+---------------------+----------------------------------- Whitney Kerr 2013-10-29 00:10:43 hymenaeos cursus lorem Suspendis Troy Oneal 2013-10-29 00:32:11 porta Vivamus condimentum Renee Coleman 2013-10-29 00:42:55 lectus quis imperdiet Fay Moss 2013-10-29 00:53:39 sit eleifend tempus a aliquam mauri Dominique Cabrera 2013-10-29 01:15:07 turpis vehicula tortor Mohammad Eaton 2013-10-29 00:21:27 Fusce ad sem vehicula morbi Cade Barr 2013-10-29 01:25:51 sapien adipiscing eget Aliquam tor Oprah Mcmillan 2013-10-29 01:36:35 varius Cum iaculis metus Astra Sherman 2013-10-29 01:58:03 dignissim odio Pellentesque primis Chelsea Malone 2013-10-29 02:08:47 pede tempor dignissim Sed luctus (10 rows) HP Vertica Analytic Database (7.1.x) Page 73 of 123

Using the HCatalog Connector Viewing Hive Schema and Table Metadata When using Hive, you access metadata about schemata and tables by executing statements written in HiveQL (Hive's version of SQL) such as SHOW TABLES. When using the HCatalog Connector, you can get metadata about the tables in the Hive database through several HP Vertica system tables. There are four system tables that contain metadata about the tables accessible through the HCatalog Connector: HCATALOG_SCHEMATA lists all of the schemata (plural of schema) that have been defined using the HCatalog Connector. See HCATALOG_SCHEMATA in the SQL Reference Manual for detailed information. HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemata defined using the HCatalog Connector. This table only shows the tables which the user querying the table can access. The information in this table is retrieved using a single call to WebHCat for each schema defined using the HCatalog Connector, which means there is a little overhead when querying this table. See HCATALOG_TABLE_LIST in the SQL Reference Manual for detailed information. HCATALOG_TABLES contains more in-depth information than HCATALOG_TABLE_LIST. However, querying this table results in HP Vertica making a REST web service call to WebHCat for each table available through the HCatalog Connector. If there are many tables in the HCatalog schemata, this query could take a while to complete. See HCATALOG_TABLES in the SQL Reference Manual for more information. HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results in one call to WebHCat per table, and therefore can take a while to complete. See HCATALOG_ COLUMNS in the SQL Reference Manual for more information. The following example demonstrates querying the system tables containing metadata for the tables available through the HCatalog Connector. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' -> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser'; CREATE SCHEMA => SELECT * FROM HCATALOG_SCHEMATA; -[ RECORD 1 ]--------+----------------------------- schema_id 45035996273864536 schema_name hcat schema_owner_id 45035996273704962 schema_owner dbadmin create_time 2013-11-05 10:19:54.70965-05 hostname hcathost port 9083 webservice_hostname hcathost webservice_port 50111 hcatalog_schema_name default HP Vertica Analytic Database (7.1.x) Page 74 of 123

Using the HCatalog Connector hcatalog_user_name metastore_db_name hcatuser hivemetastoredb => SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name hcatalogtypes hcatalog_user_name hcatuser -[ RECORD 2 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name tweets hcatalog_user_name hcatuser -[ RECORD 3 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name messages hcatalog_user_name hcatuser -[ RECORD 4 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name msgjson hcatalog_user_name hcatuser => -- Get detailed description of a specific table => SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson'; -[ RECORD 1 ]---------+----------------------------------------------------------- table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name msgjson hcatalog_user_name hcatuser min_file_size_bytes 13524 total_number_files 10 location hdfs://hive.example.com:8020/user/exampleuser/msgjson last_update_time 2013-11-05 14:18:07.625-05 output_format org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat last_access_time 2013-11-11 13:21:33.741-05 max_file_size_bytes 45762 is_partitioned f partition_expression table_owner hcatuser input_format org.apache.hadoop.mapred.textinputformat total_file_size_bytes 453534 hcatalog_group supergroup permission rwxr-xr-x => -- Get list of columns in a specific table => SELECT * FROM HCATALOG_COLUMNS WHERE table_name = 'hcatalogtypes' -> ORDER BY ordinal_position; -[ RECORD 1 ]------------+----------------- table_schema hcat hcatalog_schema default HP Vertica Analytic Database (7.1.x) Page 75 of 123

Using the HCatalog Connector table_name hcatalogtypes is_partition_column f column_name intcol hcatalog_data_type int data_type int data_type_id 6 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 1 -[ RECORD 2 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name floatcol hcatalog_data_type float data_type float data_type_id 7 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 2 -[ RECORD 3 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name doublecol hcatalog_data_type double data_type float data_type_id 7 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 3 -[ RECORD 4 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name charcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale HP Vertica Analytic Database (7.1.x) Page 76 of 123

Using the HCatalog Connector datetime_precision interval_precision ordinal_position 4 -[ RECORD 5 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name varcharcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 5 -[ RECORD 6 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name boolcol hcatalog_data_type boolean data_type boolean data_type_id 5 data_type_length 1 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 6 -[ RECORD 7 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name timestampcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 7 -[ RECORD 8 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name varbincol hcatalog_data_type binary HP Vertica Analytic Database (7.1.x) Page 77 of 123

Using the HCatalog Connector data_type varbinary(65000) data_type_id 17 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 8 -[ RECORD 9 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name bincol hcatalog_data_type binary data_type varbinary(65000) data_type_id 17 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 9 Synching an HCatalog Schema With a Local Schema Querying data from an HCatalog schema can be slow due to Hive and WebHCat performance issues. This slow performance can be especially annoying when you use the HCatalog Connector to query the HCatalog schema's metadata to examine the structure of the tables in the Hive database. To avoid this problem you can use the SYNC_WITH_HCATALOG_SCHEMA function to create a snapshot of the HCatalog schema's metadata within an HP Vertica schema. You supply this function with the name of a pre-existing HP Vertica schema and an HCatalog schema available through the HCatalog Connector. It creates a set of external tables within the HP Vertica schema that you can then use to examine the structure of the tables in the Hive database. Because the metadata in the HP Vertica schema is local, query planning is much faster. You can also use standard HP Vertica statements and system tables queries to examine the structure of Hive tables in the HCatalog schema. Caution: The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the HP Vertica schema whose names match a table in the HCatalog schema. To avoid losing data, always create an empty HP Vertica schema to sync with an HCatalog schema. The HP Vertica schema is just a snapshot of the HCatalog schema's metadata. HP Vertica does not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_ HP Vertica Analytic Database (7.1.x) Page 78 of 123

Using the HCatalog Connector WITH_HCATALOG_SCHEMA. You can call the function again to re-synchronize the local schema to the HCatalog schema. Note: By default, the function does not drop tables that appear in the local schema that do not appear in the HCatalog schema. Thus after the function call the local schema does not reflect tables that have been dropped in the Hive database. You can change this behavior by supplying the optional third Boolean argument that tells the function to drop any table in the local schema that does not correspond to a table in the HCatalog schema. The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to sync the HCatalog schema named hcat with a local schema. => CREATE SCHEMA hcat_local; CREATE SCHEMA => SELECT sync_with_hcatalog_schema('hcat_local', 'hcat'); sync_with_hcatalog_schema ---------------------------------------- Schema hcat_local synchronized with hcat tables in hcat = 56 tables altered in hcat_local = 0 tables created in hcat_local = 56 stale tables in hcat_local = 0 table changes erred in hcat_local = 0 (1 row) => -- Use vsql's \d command to describe a table in the synced schema => \d hcat_local.messages List of Fields by Tables Schema Table Column Type Size Default Not Null Primary Key Foreign Key -----------+----------+---------+----------------+-------+---------+----------+---------- ---+------------- hcat_local messages id int 8 f f hcat_local messages userid varchar(65000) 65000 f f hcat_local messages "time" varchar(65000) 65000 f f hcat_local messages message varchar(65000) 65000 f f (4 rows) Note: You can query tables in the local schema you synched with an HCatalog schema. Querying tables in a synched schema isn't much faster than directly querying the HCatalog schema because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog schema's metadata. The data in the table is still retrieved using the HCatalog Connector, Data Type Conversions from Hive to HP Vertica The data types recognized by Hive differ from the data types recognize by HP Vertica. The following table lists how the HCatalog Connector converts Hive data types into data types HP Vertica Analytic Database (7.1.x) Page 79 of 123

Using the HCatalog Connector compatible with HP Vertica. Hive Data Type TINYINT (1-byte) SMALLINT (2-bytes) INT (4-bytes) BIGINT (8-bytes) BOOLEAN FLOAT (4-bytes) DOUBLE (8-bytes) Vertica Data Type TINYINT (8-bytes) SMALLINT (8-bytes) INT (8-bytes) BIGINT (8-bytes) BOOLEAN FLOAT (8-bytes) DOUBLE PRECISION (8-bytes) STRING (2 GB max) VARCHAR (65000) BINARY (2 GB max) VARBINARY (65000) LIST/ARRAY MAP STRUCT VARCHAR (65000) containing a JSON-format representation of the list. VARCHAR (65000) containing a JSON-format representation of the map. VARCHAR (65000) containing a JSON-format representation of the struct. Data-Width Handling Differences Between Hive and HP Vertica The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS. Therefore, the data read from these files are subject to Hive's data width restrictions. For example, suppose the SerDe parses a value for an INT column into a value that is greater than 2 32-1 (the maximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into an HP Vertica's 64-bit INTEGER column because it cannot fit into Hive's 32-bit INT. Once the value has been parsed and converted to an HP Vertica data type, it is treated as an native data. This treatment can result in some confusion when comparing the results of an identical query run in Hive and in HP Vertica. For example, if your query adds two INT values that result in a value that is larger than 2 32-1, the value overflows its 32-bit INT data type, causing Hive to return an error. When running the same query with the same data in HP Vertica using the HCatalog Connector, the value will probably still fit within HP Vertica's 64-int value. Thus the addition is successful and returns a value. Using Non-Standard SerDes Hive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS). When you execute a Hive query, it uses a set of serializer and deserializer (SerDe) classes to extract data from these flat files and organize it into a relational database table. For Hive to be able HP Vertica Analytic Database (7.1.x) Page 80 of 123

Using the HCatalog Connector to extract data from a file, it must have a SerDe that can parse the data the file contains. When you create a table in Hive, you can select the SerDe to be used for the table's data. Hive has a set of standard SerDes that handle data in several formats such as delimited data and data extracted using regular expressions. You can also use third-party or custom-defined SerDes that allow Hive to process data stored in other file formats. For example, some commonly-used third-party SerDes handle data stored in JSON format. The HCatalog Connector directly fetches file segments from HDFS and uses Hive's SerDes classes to extract data from them. The Connector includes all Hive's standard SerDes classes, so it can process data stored in any file that Hive natively supports. If you want to query data from a Hive table that uses a custom SerDe, you must first install the SerDe classes on the HP Vertica cluster. Determining Which SerDe You Need If you have access to the Hive command line, you can determine which SerDe a table uses by using Hive's SHOW CREATE TABLE statement. This statement shows the HiveQL statement needed to recreate the table. For example: hive> SHOW CREATE TABLE msgjson; OK CREATE EXTERNAL TABLE msgjson( messageid int COMMENT 'from deserializer', userid string COMMENT 'from deserializer', time string COMMENT 'from deserializer', message string COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.jsonserde' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.textinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' LOCATION 'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson' TBLPROPERTIES ( 'transient_lastddltime'='1384194521') Time taken: 0.167 seconds In the example, ROW FORMAT SERDE indicates that a special SerDe is used to parse the data files. The next row shows that the class for the SerDe is named org.apache.hadoop.hive.contrib.serde2.jsonserde.you must provide the HCatalog Connector with a copy of this SerDe class so that it can read the data from this table. You can also find out which SerDe class you need by querying the table that uses the custom SerDe. The query will fail with an error message that contains the class name of the SerDe needed to parse the data in the table. In the following example, the portion of the error message that names the missing SerDe class is in bold. => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 HP Vertica Analytic Database (7.1.x) Page 81 of 123

Using the HCatalog Connector com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: java.lang.runtimeexception: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.jsonserde does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.hcatalogsplitsnoopsourcefactory.plan(hcatalogsplitsnoopsourcefactory.java:98) at com.vertica.udxfence.udxexeccontext.planudsource(udxexeccontext.java:898)... Installing the SerDe on the HP Vertica Cluster You usually have two options to getting the SerDe class file the HCatalog Connector needs: Find the installation files for the SerDe, then copy those over to your HP Vertica cluster. For example, there are several third-party JSON SerDes available from sites like Google Code and GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then download the package and copy it to your HP Vertica cluster. Directly copy the JAR files from a Hive server onto your HP Vertica cluster. The location for the SerDe JAR files depends on your Hive installation. On some systems, they may be located in /usr/lib/hive/lib. Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory on every node in your HP Vertica cluster. Important: If you add a new host to your HP Vertica cluster, remember to copy every custom SerDer JAR file to it. Troubleshooting HCatalog Connector Problems You may encounter the following issues when using the HCatalog Connector. Connection Errors When you use CREATE HCATALOG SCHEMA to create a new schema, the HCatalog Connector does not immediately attempt to connect to the WebHCat or metastore servers. Instead, when you execute a query using the schema or HCatalog-related system tables, the connector attempts to connect to and retrieve data from your Hadoop cluster. The types of errors you get depend on which parameters are incorrect. Suppose you have incorrect parameters for the metastore database, but correct parameters for WebHCat. In this case, HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The following example demonstrates creating an HCatalog schema with the correct default WebHCat information. However, the port number for the metastore database is incorrect. HP Vertica Analytic Database (7.1.x) Page 82 of 123

Using the HCatalog Connector => CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost' -> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234; CREATE SCHEMA => SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]------+--------------------- table_schema_id 45035996273864536 table_schema hcat2 hcatalog_schema default table_name test hcatalog_user_name hive => SELECT * FROM hcat2.test; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.ttransportexception: java.net.connectexception: Connection refused at org.apache.thrift.transport.tsocket.open(tsocket.java:185) at org.apache.hadoop.hive.metastore.hivemetastoreclient.open( HiveMetaStoreClient.java:277)... To resolve these issues, you must drop the schema and recreate it with the correct parameters. If you still have issues, determine whether there are connectivity issues between your HP Vertica cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more HP Vertica hosts from contacting the WebHCat, metastore, or HDFS hosts. You may also see this error if you are using HA NameNode, particularly with larger tables that HDFS splits into multiple blocks. See Using the HCatalog Connector with HA NameNode for more information about correcting this problem. UDx Failure When Querying Data: Error 3399 You might see an error when querying data (as opposed to metadata like schema information). This can happen for the following reasons: You are not using the same version of Java on your Hadoop and HP Vertica nodes. In this case you need to change one of them to match the other. You have not used hcatutil to copy the Hadoop and Hive libraries to HP Vertica. You copied the libraries but they no longer match the versions of Hive and Hadoop that you are using. The version of Hadoop you are using relies on a third-party library that you must copy manually. If you did not copy the libraries, follow the instructions in Configuring HP Vertica for HCatalog. HP Vertica Analytic Database (7.1.x) Page 83 of 123

Using the HCatalog Connector If the Hive jars that you copied from Hadoop are out of date, you might see the an error message like the following: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.jobcontext, but class was expected ] HINT hive metastore service is thrift://localhost:13433 (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) This usually signals a problem with hive-hcatalog-core jar. Make sure you have an up-to-date copy of this. Remember that if you rerun hcatutil you also need to re-create the HCatalog schema. You might also see a different form of this error: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ javax/servlet/filter ] This error can be reported even if hcatutil reports that your libraries are up to date. The javax.servlet.filter class is in a library that some versions of Hadoop use but that is not usually part of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar on a Hadoop node and copy it to the hcat/lib directory on all database nodes. If you cannot locate it on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version must be 2.3 or higher. Once you have copied the jar to the hcat/lib directory, reinstall the HCatalog connector as explained in Configuring HP Vertica for HCatalog. SerDe Errors Errors can occur if you attempt to query a Hive table that uses a non-standard SerDe. If you have not installed the SerDe JAR files on your HP Vertica cluster, you receive an error similar to the one in the following example: => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: java.lang.runtimeexception: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.jsonserde does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.hcatalogsplitsnoopsourcefactory.plan(hcatalogsplitsnoopsourcefactory.java:98) at com.vertica.udxfence.udxexeccontext.planudsource(udxexeccontext.java:898) HP Vertica Analytic Database (7.1.x) Page 84 of 123

Using the HCatalog Connector... In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To resolve this issue, install the SerDe class on your HP Vertica cluster. See Using Non-Standard SerDes for more information. This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe class. Differing Results Between Hive and HP Vertica Queries Sometimes, running the same query on Hive and on HP Vertica through the HCatalog Connector can return different results. This discrepancy is often caused by the differences between the data types supported by Hive and HP Vertica. See Data Type Conversions from Hive to HP Vertica for more information about supported data types. Preventing Excessive Query Delays Network issues or high system loads on the WebHCat server can cause long delays while querying a Hive database using the HCatalog Connector. While HP Vertica cannot resolve these issues, you can set parameters that limit how long HP Vertica waits before canceling a query on an HCatalog schema. You can set these parameters globally using HP Vertica configuration parameters. You can also set them for specific HCatalog schemas in the CREATE HCATALOG SCHEMA statement. These specific settings override the settings in the configuration parameters. The HCatConnectionTimeout configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the HCatalog Connector waits for a connection to the WebHCat server. A value of 0 (the default setting for the configuration parameter) means to wait indefinitely. If the WebHCat server does not respond by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the query. If you find that some queries on an HCatalog schema pause excessively, try setting this parameter to a timeout value, so the query does not hang indefinitely. The HCatSlowTransferTime configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_TIME parameter specify how long the HCatlog Connector waits for data after making a successful connection to the WebHCat server. After the specified time has elapsed, the HCatalog Connector determines whether the data transfer rate from the WebHCat server is at least the value set in the HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_LIMIT parameter). If it is not, then the HCatalog Connector terminates the connection and cancels the query. You can set these parameters to cancel queries that run very slowly but do eventually complete. However, query delays are usually caused by a slow connection rather than a problem establishing the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of the issue is connections that never complete, you can alternately adjust the Linux TCP socket timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter. HP Vertica Analytic Database (7.1.x) Page 85 of 123

Using the HP Vertica Storage Location for HDFS Using the HP Vertica Storage Location for HDFS The HP Vertica Storage Location for HDFS lets HP Vertica store its data in a Hadoop Distributed File System (HDFS) similarly to how it stores data on a native Linux filesystem. It lets you create a storage tier for lower-priority data to free space on your HP Vertica cluster for higher-priority data. For example, suppose you store website clickstream data in your HP Vertica database. You may find that most queries only examine the last six months of this data. However, there are a few lowpriority queries that still examine data older than six months. In this case, you could choose to move the older data to an HDFS storage location so that it is still available for the infrequent queries. The queries on the older data are slower because they now access data stored on HDFS rather than native disks. However, you free space on your HP Vertica cluster's storage for higherpriority, frequently-queried data. Storage Location for HDFS Requirements To store HP Vertica's data on HDFS, verify that: Your Hadoop cluster has WebHDFS enabled. All of the nodes in your HP Vertica cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS. See Testing Your Hadoop WebHDFS Configuration for a procedure to test the connectivity between your HP Vertica and Hadoop clusters. You have a Hadoop user whose username matches the name of the HP Vertica database administrator (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want HP Vertica to store its data. Your HDFS has enough storage available for HP Vertica data. See HDFS Space Requirements below for details. The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your HP Vertica license. HP Vertica counts data stored in an HDFS-backed storage location as part of any data allowance set by your license. See Managing Licenses in the Administrator's Guide for more information. HDFS Space Requirements If your HP Vertica database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data redundancy feature, Hadoop stores both projections multiple times. This duplication may seem excessive. However, it is similar to how a RAID level 1 or higher redundantly stores copies of both HP Vertica's primary and buddy projections. The redundant HP Vertica Analytic Database (7.1.x) Page 86 of 123

Using the HP Vertica Storage Location for HDFS copies also help the performance of HDFS by enabling multiple nodes to process a request for a file. Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details. Additional Requirements for Backing Up Data Stored on HDFS In Enterprise Edition, to back up your data stored in HDFS storage locations, your Hadoop cluster must: Have HDFS 2.0 or later installed. The vbr.py backup utility uses the snapshot feature introduced in HDFS 2.0. Have snapshotting enabled for the directories to be used for backups. The easiest way to do this is to give the database administrator's account superuser privileges in Hadoop, so that snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for each directory before using it for backups. In addition, your HP Vertica database must: Have enough Hadoop components and libraries installed in order to run the Hadoop distcp command as the HP Vertica database-administrator user (usually dbadmin). Have the JavaBinaryForUDx and HadoopHome configuration parameters set correctly. Caution: After you have created an HDFS storage location, full database backups will fail with the error message: ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure HP Vertica and Hadoop to enable the backup script to back these locations. After you configure HP Vertica and Hadoop, you can once again perform full database backups. See Backing Up HDFS Storage Locations for details on configuring your HP Vertica and Hadoop clusters to enable HDFS storage location backup. How the HDFS Storage Location Stores Data The HP Vertica Storage Location for HDFS stores data on the Hadoop HDFS similarly to the way HP Vertica stores data in the Linux file system. See Managing Storage Locations in the Administrator's Guide for more information about storage locations. When you create a storage location on HDFS, HP Vertica stores the ROS containers holding its data on HDFS. You can HP Vertica Analytic Database (7.1.x) Page 87 of 123

Using the HP Vertica Storage Location for HDFS choose which data uses the HDFS storage location: from the data for just a single table to all of the database's data. When HP Vertica reads data from or writes data to an HDFS storage location, the node storing or retrieving the data contacts the Hadoop cluster directly to transfer the data. If a single ROS container file is split among several Hadoop nodes, the HP Vertica node connects to each of them. The HP Vertica node retrieves the pieces and reassembles the file. By having each node fetch its own data directly from the source, data transfers are parallel, increasing their efficiency. Having the HP Vertica nodes directly retrieve the file splits also reduces the impact on the Hadoop cluster. What You Can Store on HDFS Use HDFS storage locations to store only data. You cannot store catalog information in an HDFS storage location. Caution: While it is possible to use an HDFS storage location for temporary data storage, you must never do so. Using HDFS for temporary storage causes severe performance issues. The only time you change an HDFS storage location's usage to temporary is when you are in the process of removing it. What HDFS Storage Locations Cannot Do Because HP Vertica uses the storage locations to store ROS containers in a proprietary format, MapReduce and other Hadoop components cannot access your HP Vertica data stored in HDFS. Never allow another program that has access to HDFS to write to the ROS files. Any outside modification of these files can lead to data corruption and loss. Use the HP Vertica Connector for Hadoop MapReduce if you need your MapReduce job to access HP Vertica data. Other applications must use the HP Vertica client libraries to access HP Vertica data. The storage location stores and reads only ROS containers. It cannot read data stored in native formats in HDFS. If you want HP Vertica to read data from HDFS, use the HP Vertica Connector for HDFS. If the data you want to access is available in a Hive database, you can use the HP Vertica Connector for HCatalog. Creating an HDFS Storage Location Before creating an HDFS storage location, you must first create a Hadoop user who can access the data: If your HDFS cluster is unsecured, create a Hadoop user whose username matches the user name of the HP Vertica database administrator account. For example, suppose your database administrator account has the default username dbadmin. You must create a Hadoop user account named dbadmin and give it full read and write access to the directory on HDFS to store files. HP Vertica Analytic Database (7.1.x) Page 88 of 123

Using the HP Vertica Storage Location for HDFS If your HDFS cluster uses Kerberos authentication, create a Kerberos principal for HP Vertica and give it read and write access to the HDFS directory that will be used for the storage location. See Configuring Kerberos. Consult the documentation for your Hadoop distribution to learn how to create a user and grant the user read and write permissions for a directory in HDFS. Use the CREATE LOCATION statement to create an HDFS storage location. To do so, you must: Supply the WebHDFS URI for HDFS directory where you want HP Vertica to store the location's data as the path argument,. This URI is the same as a standard HDFS URL, except it uses the webhdfs:// protocol and its path does not start with /webhdfs/v1/. Include the ALL NODES SHARED keywords, as all HDFS storage locations are shared storage. This is required even if you have only one HDFS node in your cluster. The following example demonstrates creating an HDFS storage location that: Is located on the Hadoop cluster whose name node's host name is hadoop. Stores its files in the /user/dbadmin directory. Is labeled coldstorage. The example also demonstrates querying the STORAGE_LOCATIONS system table to verify that the storage location was created. => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; CREATE LOCATION => SELECT node_name,location_path,location_label FROM STORAGE_LOCATIONS; node_name location_path location_label ------------------+------------------------------------------------------+--------------- - v_vmart_node0001 /home/dbadmin/vmart/v_vmart_node0001_data v_vmart_node0001 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 coldstorage v_vmart_node0002 /home/dbadmin/vmart/v_vmart_node0002_data v_vmart_node0002 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 coldstorage v_vmart_node0003 /home/dbadmin/vmart/v_vmart_node0003_data v_vmart_node0003 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 coldstorage (6 rows) Each node in the cluster has created its own directory under the dbadmin directory in HDFS. These individual directories prevent the nodes from interfering with each other's files in the shared location. HP Vertica Analytic Database (7.1.x) Page 89 of 123

Using the HP Vertica Storage Location for HDFS Creating a Storage Location Using HP Vertica for SQL on Hadoop If you are using the Enterprise Edition product, then you typically use HDFS storage locations for lower-priority data as shown in the previous example. If you are using the HP Vertica for SQL on Hadoop product, however, all of your data must be stored in HDFS. To create an HDFS storage location that complies with the HP Vertica for SQL on Hadoop license, first create the location on all nodes and then set its storage policy to HDFS. To create the location in HDFS on all nodes: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'HDFS'; Next, set the storage policy for your database to use this location: => SELECT set_object_storage_policy('dbname','hdfs ); This causes all data to be written to the HDFS storage location instead of the local disk. Adding HDFS Storage Locations to New Nodes Any nodes you add to your cluster do not have access to existing HDFS storage locations. You must manually create the storage location for the new node using the CREATE LOCATION statement. Do not use the ALL NODES keyword in this statement. Instead, use the NODE keyword with the name of the new node to tell HP Vertica that just that node needs to add the shared location. Caution: You must manually create the storage location. Otherwise, the new node uses the default storage policy (usually, storage on the local Linux filesystem) to store data that the other the nodes store in HDFS. As a result, the node can run out of disk space. The following example shows how to add the storage location from the preceding example to a new node named v_vmart_node0004: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' NODE 'v_vmart_node0004' SHARED USAGE 'data' LABEL 'coldstorage'; Any active standby nodes in your cluster when you create an HDFS-based storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS-based storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location. HP Vertica Analytic Database (7.1.x) Page 90 of 123

Using the HP Vertica Storage Location for HDFS Creating a Storage Policy for HDFS Storage Locations After you create an HDFS storage location, you assign database objects to the location by setting storage policies. Based on these storage policies, database objects such as partition ranges, individual tables, whole schemata, or even the entire database store their data in the HDFS storage location. Use the SET_OBJECT_STORAGE_POLICY function to assign objects to an HDFS storage location. In the function call, supply the label you assigned to the HDFS storage location as the location label argument. You do so using the CREATE LOCATION statement's LABEL keyword. The following topics provide examples of storing data on HDFS. Storing an Entire Table in an HDFS Storage Location The following example demonstrates using SET_OBJECT_STORAGE_POLICY to store a table in an HDFS storage location. The example statement sets the policy for an existing table, named messages, to store its data in an HDFS storage location, named coldstorage. => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage'); This table's data is moved to the HDFS storage location with the next merge-out. Alternatively, you can have HP Vertica move the data immediately by using the enforce_storage_move parameter. You can query the STORAGE_CONTAINERS system table and examine the location_label column to verify that HP Vertica has moved the data: => SELECT node_name, projection_name, location_label, total_row_count FROM V_ MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'messages%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 messages_b0 coldstorage 366057 v_vmart_node0001 messages_b1 coldstorage 366511 v_vmart_node0002 messages_b0 coldstorage 367432 v_vmart_node0002 messages_b1 coldstorage 366057 v_vmart_node0003 messages_b0 coldstorage 366511 v_vmart_node0003 messages_b1 coldstorage 367432 (6 rows) See Creating Storage Policies in the Administrator's Guide for more information about assigning storage policies to objects. Storing Table Partitions in HDFS If the data you want to store in an HDFS-based storage location is in a partitioned table, you can choose to store some of the partitions in HDFS. This capability lets you to periodically move old data that is queried less frequently off of more costly higher-speed storage (such as on a solid- state HP Vertica Analytic Database (7.1.x) Page 91 of 123

Using the HP Vertica Storage Location for HDFS drive). You can instead use slower and less expensive HDFS storage. The older data is still accessible in queries, just at a slower speed. In this scenario, the faster storage is often referred to as "hot storage," and the slower storage is referred to as "cold storage." For example, suppose you have a table named messages containing social media messages that is partitioned by the year and month of the message's timestamp. You can list the partitions in the table by querying the PARTITIONS system table. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key projection_name node_name location_label --------------+-----------------+------------------+---------------- 201309 messages_b1 v_vmart_node0001 201309 messages_b0 v_vmart_node0003 201309 messages_b1 v_vmart_node0002 201309 messages_b1 v_vmart_node0003 201309 messages_b0 v_vmart_node0001 201309 messages_b0 v_vmart_node0002 201310 messages_b0 v_vmart_node0002 201310 messages_b1 v_vmart_node0003 201310 messages_b0 v_vmart_node0001... 201405 messages_b0 v_vmart_node0002 201405 messages_b1 v_vmart_node0003 201405 messages_b1 v_vmart_node0001 201405 messages_b0 v_vmart_node0001 (54 rows) Next, suppose you find that most queries on this table access only the latest month or two of data. You may decide to move the older data to cold storage in an HDFS-based storage location. After you move the data, it is still available for queries, but with lower query performance. To move partitions to the HDFS storage location, supply the lowest and highest partition key values to be moved in the SET_OBJECT_STORAGE_POLICY function call. The following example shows how to move data between two dates to an HDFS-based storage location. In this example: Partition key value 201309 represents September 2013. Partition key value 201403 represents March 2014. The name, coldstorage, is the label of the HDFS-based storage location. => SELECT SET_OBJECT_STORAGE_POLICY('messages','coldstorage', '201309', '201403' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true'); After the statement finishes, the range of partitions now appear in the HDFS storage location labeled coldstorage. This location name now displays in the PARTITIONS system table's location_ label column. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key projection_name node_name location_label HP Vertica Analytic Database (7.1.x) Page 92 of 123

Using the HP Vertica Storage Location for HDFS --------------+-----------------+------------------+---------------- 201309 messages_b0 v_vmart_node0003 coldstorage 201309 messages_b1 v_vmart_node0001 coldstorage 201309 messages_b1 v_vmart_node0002 coldstorage 201309 messages_b0 v_vmart_node0001 coldstorage... 201403 messages_b0 v_vmart_node0002 coldstorage 201404 messages_b0 v_vmart_node0001 201404 messages_b0 v_vmart_node0002 201404 messages_b1 v_vmart_node0001 201404 messages_b1 v_vmart_node0002 201404 messages_b0 v_vmart_node0003 201404 messages_b1 v_vmart_node0003 201405 messages_b0 v_vmart_node0001 201405 messages_b1 v_vmart_node0002 201405 messages_b0 v_vmart_node0002 201405 messages_b0 v_vmart_node0003 201405 messages_b1 v_vmart_node0001 201405 messages_b1 v_vmart_node0003 (54 rows) After your initial data move, you can move additional data to the HDFS storage location periodically. You move individual partitions or a range of partitions from the "hot" storage to the "cold" storage location using the same method: => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage', '201404', '201404' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true'); SET_OBJECT_STORAGE_POLICY ---------------------------- Object storage policy set. (1 row) => SELECT projection_name, node_name, location_label FROM PARTITIONS WHERE PARTITION_KEY = '201404'; projection_name node_name location_label -----------------+------------------+---------------- messages_b0 v_vmart_node0002 coldstorage messages_b0 v_vmart_node0003 coldstorage messages_b1 v_vmart_node0003 coldstorage messages_b0 v_vmart_node0001 coldstorage messages_b1 v_vmart_node0002 coldstorage messages_b1 v_vmart_node0001 coldstorage (6 rows) Moving Partitions to a Table Stored on HDFS Another method of moving partitions from hot storage to cold storage is to move the partition's data to a separate table that is stored on HDFS. This method breaks the data into two tables, one containing hot data and the other containing cold data. Use this method if you want to prevent queries from inadvertently accessing data stored in the slower HDFS storage location. To query the older data, you must explicitly query the cold table. To move partitions: HP Vertica Analytic Database (7.1.x) Page 93 of 123

Using the HP Vertica Storage Location for HDFS 1. Create a new table whose schema matches that of the existing partitioned table. 2. Set the storage policy of the new table to use the HDFS-based storage location. 3. Use the MOVE_PARTITIONS_TO_TABLE function to move a range of partitions from the hot table to the cold table. The following example demonstrates these steps. You first create a table named cold_messages. You then assign it the HDFS-based storage location named coldstorage, and, finally, move a range of partitions. => CREATE TABLE cold_messages LIKE messages INCLUDING PROJECTIONS; => SELECT SET_OBJECT_STORAGE_POLICY('cold_messages', 'coldstorage'); => SELECT MOVE_PARTITIONS_TO_TABLE('messages','201309','201403','cold_messages'); Note: The partitions moved using this method do not immediately migrate to the storage location on HDFS. Instead, the Tuple Mover eventually moves them to the storage location. Backing Up HP Vertica Storage Locations for HDFS Note: The backup and restore features are available only in the Enterprise Edition product, not in HP Vertica for SQL on Hadoop. HP recommends that you regularly back up the data in your HP Vertica database. This recommendation includes data stored in your HDFS storage locations. The HP Vertica backup script (vbr.py) can back up HDFS storage locations. However, you must perform several configuration steps before it can back up these locations. Caution: After you have created an HDFS storage location, full database backups will fail with the error message: ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure HP Vertica and Hadoop to enable the backup script to back these locations. After you configure HP Vertica and Hadoop, you can once again perform full database backups. There are several considerations for backing up HDFS storage locations in your database: The HDFS storage location backup feature relies on the snapshotting feature introduced in HDFS 2.0. You cannot back up an HDFS storage location stored on an earlier version of HDFS. HDFS storage locations do not support object-level backups. You must perform a full database backup in order to back up the data in your HDFS storage locations. HP Vertica Analytic Database (7.1.x) Page 94 of 123

Using the HP Vertica Storage Location for HDFS Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery plan for your Hadoop cluster. Data stored on the Linux native filesystem is still backed up to the location you specify in the backup configuration file. It and the data in HDFS storage locations are handled separately by the vbr.py backup script. You must configure your HP Vertica cluster in order to restore database backups containing an HDFS storage location. See Configuring HP Vertica to Back Up HDFS Storage Locations for the configuration steps you must take. The HDFS directory for the storage location must have snapshotting enabled.you can either directly configure this yourself or enable the database administrator s Hadoop account to do it for you automatically. See Configuring Hadoop to Enable Backup of HDFS Storage for more information. The topics in this section explain the configuration steps you must take to enable the backup of HDFS storage locations. Configuring HP Vertica to Restore HDFS Storage Locations Your HP Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster. The steps you need to take depend on: The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location. The distribution of Linux running on your HP Vertica cluster. Note: Installing the Hadoop packages necessary to run distcp does not turn your HP Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the HP Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files. Configuration Overview The steps for configuring your HP Vertica cluster to restore backups for HDFS storage location are: HP Vertica Analytic Database (7.1.x) Page 95 of 123

Using the HP Vertica Storage Location for HDFS 1. If necessary, install and configure a Java runtime on the hosts in the HP Vertica cluster. 2. Find the location of your Hadoop distribution's package repository. 3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster. 4. Install the necessary Hadoop packages on your HP Vertica hosts. 5. Set two configuration parameters in your HP Vertica database related to Java and Hadoop. 6. If your HDFS storage location uses Kerberos, set additional configuration parameters to allow HP Vertica user credentials to be proxied. 7. Confirm that the Hadoop distcp command runs on your HP Vertica hosts. The following sections describe these steps in greater detail. Installing a Java Runtime You HP Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to: Execute User-Defined Extensions developed in Java. See Developing User Defined Functions in Java for more information. Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for more information. If your HP Vertica database does have a JVM installed, you must verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports. If the JVM installed on your HP Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both HP Vertica and your Hadoop distribution. See HP Vertica SDKs in Supported Platforms for a list of the JVMs compatible with HP Vertica. If your HP Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instruction in Installing the Java Runtime on Your HP Vertica Cluster. Finding Your Hadoop Distribution's Package Repository Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and.deb files. These package files are maintained in a repository. You can configure your HP Vertica hosts to access this repository to download and install Hadoop packages. HP Vertica Analytic Database (7.1.x) Page 96 of 123

Using the HP Vertica Storage Location for HDFS Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques. For example: The Hortonworks Version 2.1 topic on Configuring the Remote Repositories. The "Steps to Install CDH 5 Manually" section of the Cloudera Version 5.1.0 topic Installing CDH 5. Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your HP Vertica cluster. Be sure that the package repository that you select matches version of Hadoop distribution installed on your Hadoop cluster. Configuring HP Vertica Nodes to Access the Hadoop Distribution s Package Repository Configure the nodes in your HP Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation. The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves: Downloading a configuration file. Adding the configuration file to the package management system's configuration directory. For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring. Updating the package management system's index to have it discover new packages. The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu 12.04 host. These steps in this example are explained in the Hortonworks documentation. $ wget http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/hdp.list \ -O /etc/apt/sources.list.d/hdp.list --2014-08-20 11:06:00-- http://public-repo- 1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list Connecting to 16.113.84.10:8080... connected. Proxy request sent, awaiting response... 200 OK Length: 161 [binary/octet-stream] Saving to: `/etc/apt/sources.list.d/hdp.list' 100%[======================================>] 161 --.-K/s in 0s 2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161] $ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD HP Vertica Analytic Database (7.1.x) Page 97 of 123

Using the HP Vertica Storage Location for HDFS gpg: requesting key 07513CAD from hkp server pgp.mit.edu gpg: /root/.gnupg/trustdb.gpg: trustdb created gpg: key 07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported gpg: Total number processed: 1 gpg: imported: 1 (RSA: 1) $ gpg -a --export 07513CAD apt-key add - OK $ apt-get update Hit http://us.archive.ubuntu.com precise Release.gpg Hit http://extras.ubuntu.com precise Release.gpg Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B] Hit http://us.archive.ubuntu.com precise-updates Release.gpg Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B] Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B] Hit http://us.archive.ubuntu.com precise-backports Release.gpg Hit http://extras.ubuntu.com precise Release Get:4 http://security.ubuntu.com precise-security Release [50.7 kb] Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B] Hit http://us.archive.ubuntu.com precise Release Hit http://extras.ubuntu.com precise/main Sources Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B] Hit http://us.archive.ubuntu.com precise-updates Release Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B] Get:8 http://security.ubuntu.com precise-security/main Sources [108 kb] Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]... Reading package lists... Done You must add the Hadoop repository to all hosts in your HP Vertica cluster. Installing the Required Hadoop Packages After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are: hadoop hadoop-hdfs hadoop-client The names of the packages are usually the same across all Hadoop and Linux distributions.these packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install. To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution: On Red Hat and CentOS, the package manager command is yum. On Debian and Ubuntu, the package manager command is apt-get. HP Vertica Analytic Database (7.1.x) Page 98 of 123

Using the HP Vertica Storage Location for HDFS On SUSE the package manager command is zypper. Consult your Linux distribution's documentation for instructions on installing packages. The following example demonstrates installing the required Hadoop packages from the Hortonworks 2.1 distribution on an Ubuntu 12.04 system. # apt-get install hadoop hadoop-hdfs hadoop-client Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper The following NEW packages will be installed: bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn zookeeper 0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded. Need to get 86.6 MB of archives. After this operation, 99.8 MB of additional disk space will be used. Do you want to continue [Y/n]? Y Get:1 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main bigtop-jsvc amd64 1.0.10-1 [28.5 kb] Get:2 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main zookeeper all 3.4.5.2.1.3.0-563 [6,820 kb] Get:3 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop all 2.4.0.2.1.3.0-563 [21.5 MB] Get:4 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB] Get:5 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB] Get:6 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB] Get:7 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B] Fetched 86.6 MB in 1min 2s (1,396 kb/s) Selecting previously unselected package bigtop-jsvc. (Reading database... 197894 files and directories currently installed.) Unpacking bigtop-jsvc (from.../bigtop-jsvc_1.0.10-1_amd64.deb)... Selecting previously unselected package zookeeper. Unpacking zookeeper (from.../zookeeper_3.4.5.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop. Unpacking hadoop (from.../hadoop_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-hdfs. Unpacking hadoop-hdfs (from.../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-yarn. Unpacking hadoop-yarn (from.../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-mapreduce. Unpacking hadoop-mapreduce (from.../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-client. Unpacking hadoop-client (from.../hadoop-client_2.4.0.2.1.3.0-563_all.deb)... Processing triggers for man-db... Setting up bigtop-jsvc (1.0.10-1)... Setting up zookeeper (3.4.5.2.1.3.0-563)... update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode. Setting up hadoop (2.4.0.2.1.3.0-563)... HP Vertica Analytic Database (7.1.x) Page 99 of 123

Using the HP Vertica Storage Location for HDFS update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoopconf) in auto mode. Setting up hadoop-hdfs (2.4.0.2.1.3.0-563)... Setting up hadoop-yarn (2.4.0.2.1.3.0-563)... Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563)... Setting up hadoop-client (2.4.0.2.1.3.0-563)... Processing triggers for libc-bin... ldconfig deferred processing now taking place Setting Configuration Parameters You must set two configuration parameters to enable HP Vertica to restore HDFS data: JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command: which java HadoopHome is the path where Hadoop is installed on the HP Vertica hosts. This is the directory that contains bin/hadoop (the bin directory containing the Hadoop executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop. The following example demonstrates setting and then reviewing the values of these parameters. => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; => SELECT get_config_parameter('javabinaryforudx'); get_config_parameter ---------------------- /usr/bin/java (1 row) => ALTER DATABASE mydb SET HadoopHome = '/usr'; => SELECT get_config_parameter('hadoophome'); get_config_parameter ---------------------- /usr (1 row) There are additional parameters you may, optionally, set: HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds, the Hadoop default. If you are confident that your file system will fail more quickly, you can potentially improve performance by lowering these values. HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client HP Vertica Analytic Database (7.1.x) Page 100 of 123

Using the HP Vertica Storage Location for HDFS chooses this; HP Vertica uses the same value for all nodes. We recommend against changing this unless directed to. HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB. Setting Kerberos Parameters If your HP Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every HP Vertica node, set the following parameters: Parameter yarn.resourcemanager.proxy-user-privileges.enabled Value true yarn.resourcemanager.proxyusers.*.groups * yarn.resourcemanager.proxyusers.*.hosts * yarn.resourcemanager.proxyusers.*.users * yarn.timeline-service.http-authentication.proxyusers.*.groups * yarn.timeline-service.http-authentication.proxyusers.*.hosts * yarn.timeline-service.http-authentication.proxyusers.*.users * No changes are needed on HDFS nodes that are not also HP Vertica nodes. Confirming that distcp Runs Once the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it: 1. Log into any host in your cluster as the database administrator. 2. At the Bash shell, enter the command: $ hadoop distcp 3. The command should print a message similar to the following: usage: distcp OPTIONS [source_path...] <target_path> OPTIONS -async Should distcp execution be blocking -atomic Commit all changes or none HP Vertica Analytic Database (7.1.x) Page 101 of 123

Using the HP Vertica Storage Location for HDFS -bandwidth <arg> Specify bandwidth per map in MB -delete Delete from target, files missing in source -f <arg> List of files that need to be copied -filelimit <arg> (Deprecated!) Limit number of files copied to <= n -i Ignore failures during copy -log <arg> Folder on DFS where distcp execution logs are saved -m <arg> Max number of concurrent maps to use for copy -mapredsslconf <arg> Configuration for ssl config file, to use with hftps:// -overwrite Choose to overwrite target files unconditionally, even if they exist. -p <arg> preserve status (rbugpc)(replication, block-size, user, group, permission, checksum-type) -sizelimit <arg> (Deprecated!) Limit number of files copied to <= n bytes -skipcrccheck Whether to skip CRC checks between source and target paths. -strategy <arg> Copy strategy to use. Default is dividing work based on file sizes -tmp <arg> Intermediate work path to be used for atomic commit -update Update target, copying only missingfiles or directories 4. Repeat these steps on the other hosts in your database to ensure all of the hosts can run distcp. Troubleshooting If you cannot run the distcp command, try the following steps: If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary. Ensure the version of Java installed on your HP Vertica cluster is compatible with your Hadoop distribution. Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues. Ensure that the database administrator account has permission to execute the hadoop command. You may need to add the account to a specific group in order to allow it to run the necessary commands. HP Vertica Analytic Database (7.1.x) Page 102 of 123

Using the HP Vertica Storage Location for HDFS Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage The HP Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS storage locations. A directory must allow snapshotting before HDFS can take a snapshot. Only a Hadoop superuser can enable snapshotting on a directory. HP Vertica can enable snapshotting automatically if the database administrator is also a Hadoop superuser. If HDFS is unsecured, the following instructions apply to the database administrator account, usually dbadmin. If HDFS uses Kerberos security, the following instructions apply to the principal stored in the HP Vertica keytab file, usually vertica. The instructions below use the term "database account" to refer to this user. We recommend that you make the database administrator or principal a Hadoop superuser. If you are not able to do so, you must enable snapshotting on the directory before configuring it for use by HP Vertica. The steps you need to take to make the HP Vertica database administrator account a superuser depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's documentation for details. Instructions for two distributions are provided here. Granting Superuser Status on Hortonworks 2.1 To make the database account a Hadoop superuser: 1. Log into the your Hadoop cluster's Hortonworks Hue web user interface. If your Hortonworks cluster uses Ambari or you do not have a web-based user interface, see the Hortonworks documentation for information on granting privileges to users. 2. Click the User Admin icon. 3. In the Hue Users page, click the database account''s username. 4. Click the Step 3: Advanced tab. 5. Select Superuser status. Granting Superuser Status on Cloudera 5.1 Cloudera Hadoop treats Linux users that are members of the group named supergroup as superusers. Cloudera Manager does not automatically create this group. Cloudera also does not create a Linux user for each Hadoop user. To create a Linux account for the database account and assign the supergroup to it: 1. Log into your Hadoop cluster's NameNode as root. 2. Use the groupadd command to add a group named supergroup. HP Vertica Analytic Database (7.1.x) Page 103 of 123

Using the HP Vertica Storage Location for HDFS 3. Cloudera does not automatically create a Linux user that corresponds to the database administrator's Hadoop account. If the Linux system does not have a user for your database account you must create it. Use the adduser command to create this user. 4. Use the usermod command to add the database account to supergroup. 5. Verify that the database account is now a member of supergroup using the groups command. 6. Repeat steps 1 through 5 for any other NameNodes in your Hadoop cluster. The following example demonstrates following these steps to grant the database administrator superuser status. # adduser dbadmin # groupadd supergroup # usermod -a -G supergroup dbadmin # groups dbadmin dbadmin : dbadmin supergroup Consult the Linux distribution installed on your Hadoop cluster for more information on managing users and groups. Manually Enabling Snapshotting for a Directory If you cannot grant superuser status to the database account, you can instead enable snapshotting of each directory manually. Use the following command: hdfs dfsadmin -allowsnapshot path Issue this command for each directory on each node. Remember to do this each time you add a new node to your HDFS cluster. Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent directory to automatically enable it for child directories. You must enable it for each individual directory. Additional Requirements for Kerberos If HDFS uses Kerberos, then in addition to granting the keytab principal access, you must set a HP Vertica configuration parameter. In HP Vertica, set the HadoopConfDir parameter to the location of the directory containing the core-site.xml, hdfs-site.xml, and yarn-site.xml configuration files: => ALTER DATABASE exampledb SET HadoopConfDir = '/hadoop'; All three configuration files must be present in this directory. If your HP Vertica nodes are not co-located on HDFS nodes, then you must copy these files from an HDFS node to each HP Vertica node. Use the same path on every database node, because HadoopConfDir is a global value. HP Vertica Analytic Database (7.1.x) Page 104 of 123

Using the HP Vertica Storage Location for HDFS Testing the Database Account's Ability to Make HDFS Directories Snapshottable After making the database account a Hadoop superuser, you should verify that the account can set directories snapshottable: 1. Log into the Hadoop cluster as the database account (dbadmin by default). 2. Determine a location in HDFS where the database administrator can create a directory. The /tmp directory is usually available. Create a test HDFS directory using the command: hdfs dfs -mkdir /path/testdir 3. Make the test directory snapshottable using the command: hdfs dfsadmin -allowsnapshot /path/testdir The following example demonstrates creating an HDFS directory and making it snapshottable: $ hdfs dfs -mkdir /tmp/snaptest $ hdfs dfsadmin -allowsnapshot /tmp/snaptest Allowing snaphot on /tmp/snaptest succeeded Performing Backups Containing HDFS Storage Locations After you configure Hadoop and HP Vertica, HDFS storage locations are automatically backed up when you perform a full database backup. If you already have a backup configuration file for a full database backup, you do not need to make any changes to it. You just run the vbr.py backup script as usual to perform the full database backup. See Creating Full and Incremental Backups in the Administrator's Guide for instructions on running the vbr.py backup script. If you do not have a backup configuration file for a full database backup, you must create one to back up the data in your HDFS storage locations. See Creating vbr.py Configuration Files in the Administrator's Guide for more information. Removing HDFS Storage Locations The steps to remove an HDFS storage location are similar to standard storage locations: 1. Remove any existing data from the HDFS storage location. 2. Change the location's usage to TEMP. HP Vertica Analytic Database (7.1.x) Page 105 of 123

Using the HP Vertica Storage Location for HDFS 3. Retire the location on each host that has the storage location defined by using RETIRE_ LOCATION. You can use the enforce_storage_move parameter to make the change immediately, or wait for the Tuple Mover to perform its next movout. 4. Drop the location on each host that has the storage location defined by using DROP_ LOCATION. 5. Optionally remove the snapshots and files from the HDFS directory for the storage location. The following sections explain each of these steps in detail. Important: If you have backed up the data in the HDFS storage location you are removing, you must perform a full database backup after you remove the location. If you do not and restore the database to a backup made before you removed the location, the location's data is restored. Removing Existing Data from an HDFS Storage Location You cannot drop a storage location that contains data or is used by any storage policy. You have several options to remove data and storage policies: Drop all of the objects (tables or schemata) that store data in the location. This is the simplest option. However, you can only use this method if you no longer need the data stored in the HDFS storage location. Change the storage policies of objects stored on HDFS to another storage location. When you alter the storage policy, you force all of the data in HDFS location to move to the new location. This option requires that you have an alternate storage location available. Clear the storage policies of all objects that store data on the storage location. You then move the location's data through a process of retiring it. The following sections explain the last two options in greater detail. Moving Data to Another Storage Location You can move data off of an HDFS storage location by altering the storage policies of the objects that use the location. Use the SET_OBJECT_STORAGE_POLICY function to change each object's storage location. If you set this function's third argument to true, it moves the data off of the storage location before returning. The following example demonstrates moving the table named test from the hdfs2 storage location to another location named ssd. => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count HP Vertica Analytic Database (7.1.x) Page 106 of 123

Using the HP Vertica Storage Location for HDFS ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0003 test_b0 hdfs2 333631 v_vmart_node0003 test_b0 hdfs2 333631 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0002 test_b1 hdfs2 332233 v_vmart_node0002 test_b0 hdfs2 334136 v_vmart_node0002 test_b0 hdfs2 334136 v_vmart_node0002 test_b1 hdfs2 332233 (12 rows) => select set_object_storage_policy('test','ssd', true); set_object_storage_policy -------------------------------------------------- Object storage policy set. Task: moving storages (Table: public.test) (Projection: public.test_b0) (Table: public.test) (Projection: public.test_b1) (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b0 ssd 332233 v_vmart_node0001 test_b0 ssd 332233 v_vmart_node0001 test_b1 ssd 333631 v_vmart_node0001 test_b1 ssd 333631 v_vmart_node0002 test_b0 ssd 334136 v_vmart_node0002 test_b0 ssd 334136 v_vmart_node0002 test_b1 ssd 332233 v_vmart_node0002 test_b1 ssd 332233 v_vmart_node0003 test_b0 ssd 333631 v_vmart_node0003 test_b0 ssd 333631 v_vmart_node0003 test_b1 ssd 334136 v_vmart_node0003 test_b1 ssd 334136 (12 rows) Once you have moved all of the data in the storage location, you are ready to proceed to the next step of removing the storage location. Clearing Storage Policies Another option to move data off of a storage location is to clear the storage policy of each object storing data in the location. You clear an object's storage policy using the CLEAR_OBJECT_ STORAGE_POLICY function. Once you clear the storage policy, the Tuple Mover eventually migrates the object's data from the storage location to the database's default storage location. The TM moves the data when it performs a move storage operation. This operation runs infrequently at low priority. Therefore, it may be some time before the data migrates out of the storage location. You can speed up the data migration process by: HP Vertica Analytic Database (7.1.x) Page 107 of 123

Using the HP Vertica Storage Location for HDFS 1. Calling the RETIRE_LOCATION function to retire the storage location on each host that defines it. 2. Calling the MOVE_RETIRED_LOCATION_DATA function to move the location's data to the database's default storage location. 3. Calling the RESTORE_LOCATION function to restore the location on each host that defines it. You must perform this step because you cannot drop retired storage locations. The following example demonstrates clearing the object storage policy of a table stored on HDFS, then performing the steps to move the data off of the location. => SELECT * FROM storage_policies; schema_name object_name policy_details location_label -------------+-------------+----------------+---------------- public test Table hdfs2 (1 row) => SELECT clear_object_storage_policy('test'); clear_object_storage_policy -------------------------------- Object storage policy cleared. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 retired. (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0002 test_b1 hdfs2 332233 v_vmart_node0002 test_b0 hdfs2 334136 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0003 test_b0 hdfs2 333631 HP Vertica Analytic Database (7.1.x) Page 108 of 123

Using the HP Vertica Storage Location for HDFS (6 rows) => SELECT move_retired_location_data(); move_retired_location_data ----------------------------------------------- Move data off retired storage locations done (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b0 332233 v_vmart_node0001 test_b1 333631 v_vmart_node0002 test_b0 334136 v_vmart_node0002 test_b1 332233 v_vmart_node0003 test_b0 333631 v_vmart_node0003 test_b1 334136 (6 rows) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 restored. (1 row) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 restored. (1 row) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 restored. (1 row) Changing the Usage of HDFS Storage Locations You cannot drop a storage location that allows the storage of data files (ROS containers). Before you can drop an HDFS storage location, you must change its usage from DATA to TEMP using the ALTER_LOCATION_USE function. Make this change on every host in the cluster that defines the storage location. Important: HP recommends that you do not use HDFS storage locations for temporary file storage. Only set HDFS storage locations to allow temporary file storage as part of the removal process. HP Vertica Analytic Database (7.1.x) Page 109 of 123

Using the HP Vertica Storage Location for HDFS The following example demonstrates using the ALTER_LOCATION_USE function to change the HDFS storage location to temporary file storage. The example calls the function three times: once for each node in the cluster that defines the location. => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 usage changed. (1 row) Dropping an HDFS Storage Location After removing all data and changing the data usage of an HDFS storage location, you can drop it. Use the DROP_LOCATION function to drop the storage location from each host that defines it. The following example demonstrates dropping an HDFS storage location from a three-node HP Vertica database. => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 dropped. (1 row) => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 dropped. (1 row) => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 dropped. (1 row) HP Vertica Analytic Database (7.1.x) Page 110 of 123

Using the HP Vertica Storage Location for HDFS Removing Storage Location Files from HDFS Dropping an HDFS storage location does not automatically clean the HDFS directory that stored the location's files. Any snapshots of the data files created when backing up the location are also not deleted. These files consume disk space on HDFS and also prevent the directory from being reused as an HDFS storage location. HP Vertica refuses to create a storage location in a directory that contains existing files or subdirectories. You must log into the Hadoop cluster to delete the files from HDFS. An alternative is to use some other HDFS file management tool. Removing Backup Snapshots HDFS returns an error if you attempt to remove a directory that has snapshots: $ hdfs dfs -rm -r -f -skiptrash /user/dbadmin/v_vmart_node0001 rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since /user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots The HP Vertica backup script creates snapshots of HDFS storage locations as part of the backup process. See Backing Up HDFS Storage Locations for more information. If you made backups of your HDFS storage location, you must delete the snapshots before removing the directories. HDFS stores snapshots in a subdirectory named.snapshot. You list the snapshots in the directory using the standard HDFS ls command. The following example demonstrates listing the snapshots defined for node0001. $ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot Found 1 items drwxrwx--- - dbadmin supergroup 0 2014-09-02 10:13 /user/dbadmin/v_vmart_ node0001/.snapshot/s20140902-101358.629 To remove snapshots, use the command: hdfs dfs -removesnapshot directory snapshotname The following example demonstrates the command to delete the snapshot shown in the previous example: $ hdfs dfs -deletesnapshot /user/dbadmin/v_vmart_node0001 s20140902-101358.629 You must delete each snapshot from the directory for each host in the cluster. Once you have deleted the snapshots, you can delete the directories in the storage location. Important: Each snapshot's name is based on a timestamp down to the millisecond. Nodes independently create their own snapshot. They do not synchronize snapshot creation, so their snapshot names differ. You must list each node's snapshot directory to learn the names of the snapshots it contains. See Apache's HDFS Snapshot documentation for more information about managing and removing snapshots. HP Vertica Analytic Database (7.1.x) Page 111 of 123

Using the HP Vertica Storage Location for HDFS Removing the Storage Location Directories You can remove the directories that held the storage location's data by either of the following methods: Use an HDFS file manager to delete directories. See your Hadoop distribution's documentation to determine if it provides a file manager. Log into the Hadoop NameNode using the database administrator s account and use HDFS's rmr command to delete the directories. See Apache's File System Shell Guide for more information. The following example uses the HDFS rmr command from the Linux command line to delete the directories left behind in the HDFS storage location directory /user/dbamin. It uses the - skiptrash flag to force the immediate deletion of the files. $ hdfsp dfs -ls /user/dbadmin Found 3 items drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_ node0001 drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_ node0002 drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_ node0003 $ hdfs dfs -rmr -skiptrash /user/dbadmin/* Deleted /user/dbadmin/v_vmart_node0001 Deleted /user/dbadmin/v_vmart_node0002 Deleted /user/dbadmin/v_vmart_node0003 Troubleshooting HDFS Storage Locations This topic explains some common issues with HDFS storage locations. HDFS Storage Disk Consumption By default, HDFS makes three copies of each file it stores. This replication help prevent data loss due to disk or system failure. It also helps increase performance by allowing several nodes to handle a request for a file. An HP Vertica database with a K-Safety value of 1 or greater also stores its data redundantly using buddy projections. When a K-Safe HP Vertica database stores data in an HDFS storage location, its data redundancy is compounded by HDFS's redundancy. HDFS stores three copies of the primary projection's data, plus three copies of the buddy projection for a total of six copies of the data. If you want to reduce the amount of disk storage used by HDFS locations, you can alter the number of copies of data that HDFS stores. The HP Vertica configuration parameter named HadoopFSReplication controls the number of copies of data HDFS stores. HP Vertica Analytic Database (7.1.x) Page 112 of 123

Using the HP Vertica Storage Location for HDFS You can determine the current HDFS disk usage by logging into the Hadoop NameNode and issuing the command: hdfs dfsadmin -report This command prints the usage for the entire HDFS storage, followed by details for each node in the Hadoop cluster. The following example shows the beginning of the output from this command, with the total disk space highlighted: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32087212032 (29.88 GB) DFS Remaining: 31565144064 (29.40 GB) DFS Used: 522067968 (497.88 MB) DFS Used%: 1.63% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... After loading a simple million-row table into a table stored in an HDFS storage location, the report shows greater disk usage: Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32085299338 (29.88 GB) DFS Remaining: 31373565952 (29.22 GB) DFS Used: 711733386 (678.76 MB) DFS Used%: 2.22% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... The following HP Vertica example demonstrates: 1. Dropping the table in HP Vertica. 2. Setting the HadoopFSReplication configuration option to 1. This tells HDFS to store a single copy of an HDFS storage location's data. 3. Recreating the table and reloading its data. => DROP TABLE messages; DROP TABLE => ALTER DATABASE mydb SET HadoopFSReplication = 1; => CREATE TABLE messages (id INTEGER, text VARCHAR); CREATE TABLE => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'hdfs'); SET_OBJECT_STORAGE_POLICY ---------------------------- Object storage policy set. (1 row) HP Vertica Analytic Database (7.1.x) Page 113 of 123

Using the HP Vertica Storage Location for HDFS => COPY messages FROM '/home/dbadmin/messages.txt' DIRECT; Rows Loaded ------------- 1000000 Running the HDFS report on Hadoop now shows less disk space use: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32086278190 (29.88 GB) DFS Remaining: 31500988416 (29.34 GB) DFS Used: 585289774 (558.18 MB) DFS Used%: 1.82% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... Caution: Reducing the number of copies of data stored by HDFS increases the risk of data loss. It can also negatively impact the performance of HDFS by reducing the number of nodes that can provide access to a file. This slower performance can impact the performance of HP Vertica queries that involve data stored in an HDFS storage location. Kerberos Authentication When Creating a Storage Location If HDFS uses Kerberos authentication, then the CREATE LOCATION statement authenticates using the HP Vertica keytab principal, not the principal of the user performing the action. If the creation fails with an authentication error, verify that you have followed the steps described in Configuring Kerberos to configure this principal. When creating an HDFS storage location on a Hadoop cluster using Kerberos, CREATE LOCATION reports the principal being used as in the following example: => CREATE LOCATION 'webhdfs://hadoop.example.com:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; NOTICE 0: Performing HDFS operations using kerberos principal [vertica/hadoop.example.com] CREATE LOCATION Backup or Restore Fails When Using Kerberos When backing up an HDFS storage location that uses Kerberos, you might see an error such as: createsnapshot: Failed on local exception: java.io.ioexception: HP Vertica Analytic Database (7.1.x) Page 114 of 123

Using the HP Vertica Storage Location for HDFS java.lang.illegalargumentexception: Server has invalid Kerberos principal: hdfs/test.example.com@example.com; When restoring an HDFS storage location that uses Kerberos, you might see an error such as: Error msg: Initialization thread logged exception: Distcp failure! Either of these failures means that HP Vertica could not find the required configuration files in the HadoopConfDir directory. Usually this is because you have set the parameter but not copied the files from an HDFS node to your HP Vertica node. See "Additional Requirements for Kerberos" in Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage. HP Vertica Analytic Database (7.1.x) Page 115 of 123

Integrating HP Vertica with the MapR Distribution of Hadoop Integrating HP Vertica with the MapR Distribution of Hadoop MapR is a distribution of Apache Hadoop produced by MapR Technologies that extends the standard Hadoop components with its own features. By adding HP Vertica to a MapR cluster, you can benefit from the advantages of both HP Vertica and Hadoop. To learn more about integrating HP Vertica and MapR, see Configuring HP Vertica Analytics Platform with MapR,which appears on the MapR website. HP Vertica Analytic Database (7.1.x) Page 116 of 123

Using Kerberos with Hadoop Using Kerberos with Hadoop If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure HP Vertica to make authenticated connections. The details of this configuration vary, based on which methods you are using to access HDFS data: How Vertica uses Kerberos With Hadoop Configuring Kerberos How Vertica uses Kerberos With Hadoop HP Vertica authenticates with Hadoop in two ways that require different configurations: User Authentication On behalf of the user, by passing along the user's existing Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector. HP Vertica Authentication On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file. User Authentication To use HP Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, HP Vertica automatically retrieves this ticket.the database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. HP Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The following figure shows how the user, HP Vertica, Hadoop, and Kerberos interact in user authentication: HP Vertica Analytic Database (7.1.x) Page 117 of 123

Using Kerberos with Hadoop When using the HDFS Connector or the HCatalog Connector, or when reading an ORC file stored in HDFS, HP Vertica uses the client identity as the preceding figure shows. HP Vertica Authentication Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, HP Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for HP Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos. The following figure shows the interactions required for HP Vertica authentication: HP Vertica Analytic Database (7.1.x) Page 118 of 123

Using Kerberos with Hadoop Each HP Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node. When creating HDFS storage locations HP Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement. See Also For specific configuration instructions, see Configuring Kerberos. HP Vertica Analytic Database (7.1.x) Page 119 of 123

Using Kerberos with Hadoop Configuring Kerberos HP Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and HP Vertica clusters. Prerequisite: Setting Up Users and the Keytab File If you have not already configured Kerberos authentication for HP Vertica, follow the instructions in Configure HP Vertica for Kerberos Authentication. In particular: Create one Kerberos principal per node. Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (see Specify the Location of the Keytab File). Set KerberosServiceName to the name of the principal (see Inform HP Vertica About the Kerberos Principal). HCatalog Connector You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of HP Vertica users. If the current user has a Kerberos key, then HP Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS. In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the HP Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatutil (see Configuring HP Vertica for HCatalog). HDFS Connector The HDFS Connector loads data from HDFS into HP Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS. HDFS Storage Location You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps: HP Vertica Analytic Database (7.1.x) Page 120 of 123

Using Kerberos with Hadoop Grant Hadoop superuser privileges to the new principals. Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring HP Vertica to Restore HDFS Storage Locations. Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual. Token Expiration HP Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds: => ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400'; If the current age of the token is greater than the value specified in this parameter, HP Vertica refreshes the token before accessing data stored in HDFS. See Also How Vertica uses Kerberos With Hadoop Troubleshooting Kerberos Authentication HP Vertica Analytic Database (7.1.x) Page 121 of 123

Using Kerberos with Hadoop HP Vertica Analytic Database (7.1.x) Page 122 of 123

We appreciate your feedback! If you have comments about this document, you can contact the documentation team by email. If an email client is configured on this system, click the link above and an email window opens with the following information in the subject line: Feedback on Hadoop Integration Guide (Vertica Analytic Database 7.1.x) Just add your feedback to the email and click send. If no email client is available, copy the information above to a new message in a web mail client, and send your feedback to vertica-docfeedback@hp.com. HP Vertica Analytic Database Page 123 of 123