HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 5/18/2016
Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Restricted Rights Legend Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Copyright Notice Copyright 2006-2016 Hewlett Packard Enterprise Development LP Trademark Notices Adobe is a trademark of Adobe Systems Incorporated. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. This product includes an interface of the 'zlib' general purpose compression library, which is Copyright 1995-2002 Jean-loup Gailly and Mark Adler. HPE Vertica Analytic Database (7.2.x) Page 2 of 145
Contents Introduction to Hadoop Integration 8 Hadoop Distributions 8 Integration Options 8 File Paths 9 Cluster Layout 10 Co-Located Clusters 10 Hardware Recommendations 11 Configuring Hadoop for Co-Located Clusters 12 webhdfs 12 YARN 12 Hadoop Balancer 13 Replication Factor 13 Disk Space for Non-HDFS Use 13 Separate Clusters 14 Choosing Which Hadoop Interface to Use 16 Creating an HDFS Storage Location 16 Reading ORC and Parquet Files 16 Using the HCatalog Connector 16 Using the HDFS Connector 17 Using the MapReduce Connector 17 Using Kerberos with Hadoop 18 How Vertica uses Kerberos With Hadoop 18 User Authentication 18 Vertica Authentication 19 See Also 20 Configuring Kerberos 21 Prerequisite: Setting Up Users and the Keytab File 21 HCatalog Connector 21 HDFS Connector 21 HDFS Storage Location 22 Token Expiration 22 See Also 23 Reading Native Hadoop File Formats 24 HPE Vertica Analytic Database (7.2.x) Page 3 of 145
Requirements 24 Creating External Tables 24 Loading Data 25 Supported Data Types 26 Kerberos Authentication 26 Examples 26 See Alsos 26 Query Performance 27 Considerations When Writing Files 27 Predicate Pushdown 27 Data Locality 28 Configuring hdfs:/// Access 28 Troubleshooting Reads from Native File Formats 29 webhdfs Error When Using hdfs URIs 29 Reads from Parquet Files Report Unexpected Data-Type Mismatches 29 Time Zones in Timestamp Values Are Not Correct 30 Some Date and Timestamp Values Are Wrong by Several Days 30 Error 7087: Wrong Number of Columns 30 Using the HCatalog Connector 31 Hive, HCatalog, and WebHCat Overview 31 HCatalog Connection Features 32 HCatalog Connection Considerations 32 How the HCatalog Connector Works 33 HCatalog Connector Requirements 34 Vertica Requirements 34 Hadoop Requirements 34 Testing Connectivity 34 Installing the Java Runtime on Your Vertica Cluster 35 Installing a Java Runtime 36 Setting the JavaBinaryForUDx Configuration Parameter 37 Configuring Vertica for HCatalog 38 Copy Hadoop Libraries and Configuration Files 38 Install the HCatalog Connector 40 Upgrading to a New Version of Vertica 41 Additional Options for Native File Formats 41 Using the HCatalog Connector with HA NameNode 42 Defining a Schema Using the HCatalog Connector 42 Querying Hive Tables Using HCatalog Connector 44 Viewing Hive Schema and Table Metadata 44 HPE Vertica Analytic Database (7.2.x) Page 4 of 145
Synchronizing an HCatalog Schema or Table With a Local Schema or Table 49 Examples 50 Data Type Conversions from Hive to Vertica 51 Data-Width Handling Differences Between Hive and Vertica 53 Using Non-Standard SerDes 53 Determining Which SerDe You Need 54 Installing the SerDe on the Vertica Cluster 55 Troubleshooting HCatalog Connector Problems 55 Connection Errors 55 UDx Failure When Querying Data: Error 3399 56 SerDe Errors 58 Differing Results Between Hive and Vertica Queries 58 Preventing Excessive Query Delays 59 Using the HDFS Connector 60 HDFS Connector Requirements 60 Uninstall Prior Versions of the HDFS Connector 60 webhdfs Requirements 61 Kerberos Authentication Requirements 61 Testing Your Hadoop webhdfs Configuration 61 Loading Data Using the HDFS Connector 64 The HDFS File URL 65 Copying Files in Parallel 66 Viewing Rejected Rows and Exceptions 67 Creating an External Table with an HDFS Source 67 Load Errors in External Tables 69 HDFS ConnectorTroubleshooting Tips 69 User Unable to Connect to Kerberos-Authenticated Hadoop Cluster 70 Resolving Error 5118 71 Transfer Rate Errors 71 Error Loading Many Files 72 Using HDFS Storage Locations 73 Storage Location for HDFS Requirements 73 HDFS Space Requirements 74 Additional Requirements for Backing Up Data Stored on HDFS 74 How the HDFS Storage Location Stores Data 75 What You Can Store on HDFS 75 What HDFS Storage Locations Cannot Do 76 Creating an HDFS Storage Location 76 Creating a Storage Location Using Vertica for SQL on Hadoop 78 Adding HDFS Storage Locations to New Nodes 78 HPE Vertica Analytic Database (7.2.x) Page 5 of 145
Creating a Storage Policy for HDFS Storage Locations 79 Storing an Entire Table in an HDFS Storage Location 79 Storing Table Partitions in HDFS 80 Moving Partitions to a Table Stored on HDFS 82 Backing Up Vertica Storage Locations for HDFS 83 Configuring Vertica to Restore HDFS Storage Locations 84 Configuration Overview 85 Installing a Java Runtime 85 Finding Your Hadoop Distribution's Package Repository 86 Configuring Vertica Nodes to Access the Hadoop Distribution s Package Repository 86 Installing the Required Hadoop Packages 88 Setting Configuration Parameters 89 Setting Kerberos Parameters 91 Confirming that distcp Runs 91 Troubleshooting 92 Configuring Hadoop and Vertica to Enable Backup of HDFS Storage 93 Granting Superuser Status on Hortonworks 2.1 93 Granting Superuser Status on Cloudera 5.1 94 Manually Enabling Snapshotting for a Directory 94 Additional Requirements for Kerberos 95 Testing the Database Account's Ability to Make HDFS Directories Snapshottable 95 Performing Backups Containing HDFS Storage Locations 96 Removing HDFS Storage Locations 96 Removing Existing Data from an HDFS Storage Location 97 Moving Data to Another Storage Location 97 Clearing Storage Policies 98 Changing the Usage of HDFS Storage Locations 100 Dropping an HDFS Storage Location 101 Removing Storage Location Files from HDFS 102 Removing Backup Snapshots 102 Removing the Storage Location Directories 103 Troubleshooting HDFS Storage Locations 103 HDFS Storage Disk Consumption 103 Kerberos Authentication When Creating a Storage Location 106 Backup or Restore Fails When Using Kerberos 106 Using the MapReduce Connector 107 MapReduce Connector Features 107 Prerequisites 108 Hadoop and Vertica Cluster Scaling 108 Installing the Connector 108 Accessing Vertica Data From Hadoop 111 Selecting VerticaInputFormat 111 Setting the Query to Retrieve Data From Vertica 112 HPE Vertica Analytic Database (7.2.x) Page 6 of 145
Using a Simple Query to Extract Data From Vertica 112 Using a Parameterized Query and Parameter Lists 113 Using a Discrete List of Values 113 Using a Collection Object 113 Scaling Parameter Lists for the Hadoop Cluster 114 Using a Query to Retrieve Parameter Values for a Parameterized Query 115 Writing a Map Class That Processes Vertica Data 115 Working with the VerticaRecord Class 116 Writing Data to Vertica From Hadoop 117 Configuring Hadoop to Output to Vertica 117 Defining the Output Table 117 Writing the Reduce Class 119 Storing Data in the VerticaRecord 119 Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time 122 Specifying the Location of the Connector.jar File 122 Specifying the Database Connection Parameters 123 Parameters for a Separate Output Database 124 Example Vertica Connector for Hadoop Map Reduce Application 125 Compiling and Running the Example Application 128 Compiling the Example (optional) 130 Running the Example Application 131 Verifying the Results 132 Using Hadoop Streaming with the Vertica Connector for Hadoop Map Reduce 133 Reading Data From Vertica in a Streaming Hadoop Job 133 Writing Data to Vertica in a Streaming Hadoop Job 136 Loading a Text File From HDFS into Vertica 137 Accessing Vertica From Pig 139 Registering the Vertica.jar Files 140 Reading Data From Vertica 140 Writing Data to Vertica 141 Integrating Vertica with the MapR Distribution of Hadoop 143 Send Documentation Feedback 145 HPE Vertica Analytic Database (7.2.x) Page 7 of 145
Introduction to Hadoop Integration Introduction to Hadoop Integration Apache Hadoop, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System. You can use HDFS from Vertica in several ways: You can import HDFS data into locally-stored ROS files. You can access HDFS data in place, using external tables. You can use HDFS as a storage location for ROS files. Hadoop includes two other components of interest: Hive, a data warehouse that provides the ability to query data stored in Hadoop. HCatalog, a component that makes Hive metadata available to applications, such as Vertica, outside of Hadoop. A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop. Hadoop Distributions Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported. Integration Options Vertica supports two cluster architectures. Which you use affects the decisions you make about integration. You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of local data. This option is supported only for Vertica for SQL on Hadoop. HPE Vertica Analytic Database (7.2.x) Page 8 of 145
Introduction to Hadoop Integration You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop. This option is not supported for Vertica for SQL on Hadoop.. These layout options are described in Cluster Layout. Both layouts support several interfaces for using Hadoop: An HDFS Storage Location uses HDFS to hold Vertica data (ROS files). The HCatalog Connector lets Vertica query data that is stored in a Hive database the same way you query data stored natively in a Vertica schema. Vertica can directly query data in Reading Native Hadoop File Formats (ORC and Parquet). This option is faster than using the HCatalog Connector for this type of data. The HDFS Connector lets Vertica import HDFS data. It also lets Vertica read HDFS data as an external table without using Hive. The MapReduce Connector lets you create Hadoop MapReduce jobs that retrieve data from Vertica. These jobs can also insert data into Vertica. File Paths Hadoop file paths are generally expressed using the webhdfs scheme, such as 'webhdfs://somehost:port/opt/data/filename'. These paths are URIs, so if you need to escape a special character in a path, use URI escaping. For example: webhdfs://somehost:port/opt/data/my%20file HPE Vertica Analytic Database (7.2.x) Page 9 of 145
Cluster Layout Cluster Layout Vertica and Hadoop each use a cluster of nodes for distributed processing. These clusters can be co-located, meaning you run both products on the same machines, or separate. Co-Located Clusters are for use with Vertica for SQL on Hadoop licenses. Separate Clusters are for use with Premium Edition and Community Edition licenses. With either architecture, if you are using the hdfs scheme to read ORC or Parquet files, you must do some additional configuration. See Configuring hdfs:/// Access. Co-Located Clusters With co-located clusters, Vertica is installed on some or all of your Hadoop nodes. The Vertica nodes use a private network in addition to the public network used by all Hadoop nodes, as the following figure shows: You might choose to place Vertica on all of your Hadoop nodes or only on some of them. If you are using HDFS Storage Locations you should use at least three Vertica nodes, the minimum number for K-Safety. Using more Vertica nodes can improve performance because the HDFS data needed by a query is more likely to be local. Normally, both Hadoop and Vertica use the entire node. Because this configuration uses shared nodes, you must address potential resource contention in your configuration on those nodes. See Configuring Hadoop for Co-Located Clusters for more information. No changes are needed on Hadoop-only nodes. You can place Hadoop and Vertica clusters within a single rack, or you can span across many racks and nodes. Spreading node types across racks can improve efficiency. HPE Vertica Analytic Database (7.2.x) Page 10 of 145
Cluster Layout Hardware Recommendations Hadoop clusters frequently do not have identical provisioning requirements or hardware configurations. However, Vertica nodes should be equivalent in size and capability, per the best-practice standards recommended in General Hardware and OS Requirements and Recommendations in Installing Vertica. Because Hadoop cluster specifications do not always meet these standards, Hewlett Packard Enterprise recommends the following specifications for Vertica nodes in your Hadoop cluster. Specifications For... Processor Recommendation For best performance, run: Two-socket servers with 8 14 core CPUs, clocked at or above 2.6 GHz for clusters over 10 TB Single-socket servers with 8 12 cores clocked at or above 2.6 GHz for clusters under 10 TB Memory Distribute the memory appropriately across all memory channels in the server: Minimum 8 GB of memory per physical CPU core in the server High-performance applications 12 16 GB of memory per physical core Type at least DDR3-1600, preferably DDR3-1866 Storage Read/write: Minimum 40 MB/s per physical core of the CPU For best performance 60 80 MB/s per physical core Storage post RAID: Each node should have 1 9 TB. For a production setting, RAID 10 is recommended. In some cases, RAID 50 is acceptable. HPE Vertica Analytic Database (7.2.x) Page 11 of 145
Cluster Layout Because of the heavy compression and encoding that Vertica does, SSDs are not required. In most cases, a RAID of more, lessexpensive HDDs performs just as well as a RAID of fewer SSDs. If you intend to use RAID 50 for your data partition, you should keep a spare node in every rack, allowing for manual failover of a Vertica node in the case of a drive failure. A Vertica node recovery is faster than a RAID 50 rebuild. Also, be sure to never put more than 10 TB compressed on any node, to keep node recovery times at an acceptable rate. Network 10 GB networking in almost every case. With the introduction of 10 GB over cat6a (Ethernet), the cost difference is minimal. Configuring Hadoop for Co-Located Clusters If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements. webhdfs Hadoop has two services that can provide web access to HDFS: webhdfs httpfs For Vertica, you must use the webhdfs service. YARN The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on YARNmanaged Hadoop nodes you must make some changes in YARN. HPE recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution: HPE Vertica Analytic Database (7.2.x) Page 12 of 145
Cluster Layout If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica. If you are using Cloudera, enable and configure static service pools. Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes. Hadoop Balancer The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions. If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it, degrading performance. This behavior can also occur when reading ORC or Parquet files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.) To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this. Replication Factor By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection. To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in Troubleshooting HDFS Storage Locations. Disk Space for Non-HDFS Use You also need to reserve some disk space for non-hdfs use. To reserve disk space using Ambari, set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file. Setting this parameter preserves space for non-hdfs files that Vertica requires. HPE Vertica Analytic Database (7.2.x) Page 13 of 145
Cluster Layout Separate Clusters In the Premium Edition product, your Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for Vertica for SQL on Hadoop, in which Vertica nodes are co-located on Hadoop nodes. The following figure illustrates the configuration for separate clusters:: The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network. The layout shown in the figure calls for two networks, and there are benefits to adding a third: Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some HPE Vertica Analytic Database (7.2.x) Page 14 of 145
Cluster Layout networks, the command and control and passing of data are split across two networks. Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network. Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve performance. If the configuration does not support a client network, than client connections should use the shared network. HPE Vertica Analytic Database (7.2.x) Page 15 of 145
Choosing Which Hadoop Interface to Use Choosing Which Hadoop Interface to Use Vertica provides several ways to interact with data stored in Hadoop. This section explains how to choose among them. Decisions about Cluster Layout can affect the decisions you make about Hadoop interfaces. Creating an HDFS Storage Location Using a storage location to store data in the Vertica native file format (ROS) delivers the best query performance among the available Hadoop options. (Storing ROS files on the local disk rather than in Hadoop is faster still.) If you already have data in Hadoop, however, doing this means you are importing that data into Vertica. For co-located clusters, which does not use local file storage, you might still choose to use an HDFS storage location for better performance. You can use the HDFS Connector to load data that is already in HDFS into Vertica. For separate clusters, which use local file storage, consider using an HDFS storage location for lower-priority data. See Using HDFS Storage Locations and Using the HDFS Connector. Reading ORC and Parquet Files If your data is stored in the Optimized Row Columnar (ORC) or Parquet format, Vertica can query that data directly from HDFS. This option is faster than using the HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database. Vertica reads the data in place; no extra copies are made. See Reading Native Hadoop File Formats. Using the HCatalog Connector The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface you can read all file formats supported by Hadoop, including Parquet and ORC, and Vertica can use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog Connector is also sensitive to HPE Vertica Analytic Database (7.2.x) Page 16 of 145
Choosing Which Hadoop Interface to Use changes in the Hadoop libraries on which it depends; upgrading your Hadoop cluster might affect your HCatalog connections. See Using the HCatalog Connector. Using the HDFS Connector The HDFS Connector can be used to create and query external tables, reading the data in place rather than making copies. The HDFS Connector can be used with any data format for which a parser is available. It does not use Hive data; you have to define the table yourself. Its performance can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of columnar file formats. See Using the HDFS Connector. Using the MapReduce Connector The other interfaces described in this section allow you to read Hadoop data from Vertica or create Vertica data in Hadoop. The MapReduce Connector, in contrast, allows you to integrate with Hadoop's MapReduce jobs. Use this connector to send Vertica data to MapReduce or to have MapReduce jobs create data in Vertica. See Using the MapReduce Connector. HPE Vertica Analytic Database (7.2.x) Page 17 of 145
Using Kerberos with Hadoop Using Kerberos with Hadoop If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure Vertica to make authenticated connections. The details of this configuration vary, based on which methods you are using to access HDFS data: How Vertica uses Kerberos With Hadoop Configuring Kerberos How Vertica uses Kerberos With Hadoop Vertica authenticates with Hadoop in two ways that require different configurations: User Authentication On behalf of the user, by passing along the user's existing Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector. Vertica Authentication On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file. User Authentication To use Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, Vertica automatically retrieves this ticket.the database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The user must have been granted permission to access the relevant files in HDFS. This permission is checked the first time Vertica reads HDFS data. The following figure shows how the user, Vertica, Hadoop, and Kerberos interact in user authentication: HPE Vertica Analytic Database (7.2.x) Page 18 of 145
Using Kerberos with Hadoop When using the HDFS Connector or the HCatalog Connector, or when reading an ORC or Parquet file stored in HDFS, Vertica uses the client identity as the preceding figure shows. Vertica Authentication Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos. The following figure shows the interactions required for Vertica authentication: HPE Vertica Analytic Database (7.2.x) Page 19 of 145
Using Kerberos with Hadoop Each Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node. When creating HDFS storage locations Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement. See Also For specific configuration instructions, see Configuring Kerberos. HPE Vertica Analytic Database (7.2.x) Page 20 of 145
Using Kerberos with Hadoop Configuring Kerberos Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and Vertica clusters. Prerequisite: Setting Up Users and the Keytab File If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure for Kerberos Authentication. In particular: Create one Kerberos principal per node. Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (see Specify the Location of the Keytab File). Set KerberosServiceName to the name of the principal (see Inform About the Kerberos Principal). HCatalog Connector You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS. In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatutil (see Configuring Vertica for HCatalog). HDFS Connector The HDFS Connector loads data from HDFS into Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS. HPE Vertica Analytic Database (7.2.x) Page 21 of 145
Using Kerberos with Hadoop HDFS Storage Location You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps: Grant Hadoop superuser privileges to the new principals. Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations. Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual. Token Expiration Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds: => ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400'; If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS. HPE Vertica Analytic Database (7.2.x) Page 22 of 145
Using Kerberos with Hadoop See Also How Vertica uses Kerberos With Hadoop Troubleshooting Kerberos Authentication HPE Vertica Analytic Database (7.2.x) Page 23 of 145
Reading Native Hadoop File Formats Reading Native Hadoop File Formats When you create external tables or copy data into tables, you can access data in certain native Hadoop formats directly. Currently, Vertica supports the ORC (Optimized Row Columnar) and Parquet formats. Because this approach allows you to define your tables yourself instead of fetching the metadata through webhcat, these readers can provide slightly better performance than the HCatalog Connector.. If you are already using the HCatalog Connector for other reasons, however, you might find it more convenient to use it to read data in these formats also. See Using the HCatalog Connector. You can use the hdfs scheme to access ORC and Parquet files stored in HDFS, as explained later in this section. To use this scheme you must perform some additional configuration; see Configuring hdfs:/// Access. Requirements The ORC or Parquet files must not use complex data types. All simple data types supported in Hive version 0.11 or later are supported. Files compressed by Hive or Impala require Zlib (GZIP) or Snappy compression. Vertica does not support LZO compression for these formats. Creating External Tables In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC or PARQUET as follows: => CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path ORC; => CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path PARQUET; If the file resides on the local file system of the node where you issue the command Use a local file path for path. Escape special characters in file paths with backslashes. HPE Vertica Analytic Database (7.2.x) Page 24 of 145
Reading Native Hadoop File Formats If the file resides elsewhere in HDFS Use the hdfs:/// prefix (three slashes), and then specify the file path. Escape special characters in HDFS paths using URI encoding, for example %20 for space. Vertica automatically converts from the hdfs scheme to the webhdfs scheme if necessary. You can also directly use a webhdfs:// prefix and specify the host name, port, and file path. Using the hdfs scheme potentially provides better performance when reading files not protected by Kerberos. When defining an external table, you must define all of the columns in the file. Unlike with some other data sources, you cannot select only the columns of interest. If you omit columns, the ORC or Parquet reader aborts with an error. Files stored in HDFS are governed by HDFS privileges. For files stored on the local disk, however, Vertica requires that users be granted access. All users who have administrative privileges have access. For other users, you must create a storage location and grant access to it. See CREATE EXTERNAL TABLE AS COPY. HDFS privileges are still enforced, so it is safe to create a location for webhdfs://host:port. Only users who have access to both the Vertica user storage location and the HDFS directory can read from the table. Loading Data In the COPY statement, specify a format of ORC or PARQUET: => COPY tablename FROM path ORC; => COPY tablenamefrom path PARQUET; For files that are not local, specify ON ANY NODE to improve performance. => COPY t FROM 'hdfs:///opt/data/orcfile' ON ANY NODE ORC; As with external tables, path may be a local or hdfs:/// path. Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest. HPE Vertica Analytic Database (7.2.x) Page 25 of 145
Reading Native Hadoop File Formats Supported Data Types The Vertica ORC and Parquet file readers can natively read columns of all data types supported in Hive version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The readers do not attempt to read only some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types. Kerberos Authentication If the file to be read is located on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal. Examples The following example shows how you can read from all ORC files in a local directory. This example uses all supported data types. => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5)) AS COPY FROM '/data/orc_test_*.orc' ORC; The following example shows the error that is produced if the file you specify is not recognized as an ORC file: => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC; ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file See Alsos Query Performance Troubleshooting Reads from Native File Formats HPE Vertica Analytic Database (7.2.x) Page 26 of 145
Reading Native Hadoop File Formats Query Performance When working with external tables in native formats, Vertica tries to improve performance in two ways: Pushing query execution closer to the data so less has to be read and transmitted Using data locality in planning the query Considerations When Writing Files The decisions you make when writing ORC and Parquet files can affect performance when using them. To get the best performance from Vertica, follow these guidelines when writing your files: Use the latest available Hive version. (You can still read your files with earlier versions.) Use a large stripe size. 256 MB or greater is preferred. Partition the data at the table level. Sort the columns based on frequency of access, with most-frequently accessed columns appearing first. Use Snappy or Zlib/GZIP compression. Predicate Pushdown Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics, and row group indexes. Predicates are applied only to the first two levels. Parquet files can have statistics in the ColumnMetaData and DataPageHeader. Predicates are applied only to the ColumnMetaData. Predicate pushdown is automatically applied for files written with Hive version 0.14 and later. Files written with earlier versions of Hive might not contain the required statistics. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_ HPE Vertica Analytic Database (7.2.x) Page 27 of 145
Reading Native Hadoop File Formats EVENTS system table. If you are seeing performance problems with your queries, check this table for these events. Data Locality In a cluster where Vertica nodes are co-located on HDFS nodes, the query can use data locality to improve performance. For Vertica to do so, both the following conditions must exist:: The data is on an HDFS node where a database node is also present. The query is not restricted to specific nodes using ON NODE. When both these conditions exist, the query planner uses the co-located database node to read that data locally, instead of making a network call. You can see how much data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data. Configuring hdfs:/// Access When reading ORC or Parquet files from HDFS, you can use the hdfs scheme instead of the webhdfs scheme. Using the hdfs scheme can improve performance by bypassing the webhdfs service. To support the hdfs scheme, your Vertica nodes need access to certain Hadoop configuration files. If Vertica is co-located on HDFS nodes, then those files are already present. Verify that the HadoopConfDir environment variable is correctly set. Its path should include a directory containing the core-site.xml and hdfs-site.xml files. If Vertica is running on a separate cluster, you must copy the required files to those nodes and set the HadoopConfDir environment variable. A simple way to do so is to configure your Vertica nodes as Hadoop edge nodes. Edge nodes are used to run client applications; from Hadoop's perspective, Vertica is a client application. You can use Ambari or Cloudera Manager to configure edge nodes. For more information, see the documentation for your Hadoop vendor. HPE Vertica Analytic Database (7.2.x) Page 28 of 145
Reading Native Hadoop File Formats Using the hdfs scheme does not remove the need for access to the webhdfs service. The hdfs scheme is not available for all files. If hdfs is not available, then Vertica automatically uses webhdfs instead. If you update the configuration files after starting Vertica, use the following statement to refresh them: => SELECT CLEAR_CACHES(); Troubleshooting Reads from Native File Formats You might encounter the following issues when reading ORC or Parquet files. webhdfs Error When Using hdfs URIs When creating an external table or loading data and using the hdfs scheme, you might see errors from webhdfs failures. Such errors indicate that Vertica was not able to use the hdfs scheme and fell back to webhdfs, but that the webhdfs configuration is incorrect. Verify that the HDFS configuration files in HadoopConfDir have the correct webhdfs configuration for your Hadoop cluster. See Configuring hdfs:/// Access for information about use of these files. See your Hadoop documentation for information about webhdfs configuration. Reads from Parquet Files Report Unexpected Data-Type Mismatches If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INT, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example: ERROR 0: Datatype mismatch: column 2 in the parquet_cpp source [/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING. HPE Vertica Analytic Database (7.2.x) Page 29 of 145
Reading Native Hadoop File Formats Time Zones in Timestamp Values Are Not Correct Reading time stamps from an ORC or Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the ORC and Parquet formats do not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets time stamps read from ORC or Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time, such as the following: WARNING 0: SQL TIMESTAMPTZ is more appropriate for ORC TIMESTAMP because values are stored in UTC When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP. Some Date and Timestamp Values Are Wrong by Several Days When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values. Error 7087: Wrong Number of Columns When loading data, you might see an error stating that you have the wrong number of columns: => CREATE TABLE nation (nationkey bigint, name varchar(500), regionkey bigint, comment varchar(500)); CREATE TABLE => COPY nation from :orc_dir ORC; ERROR 7087: Attempt to load 4 columns from an orc source [/tmp/orc_glob/test.orc] that has 9 columns When you load data from Hadoop native file formats, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition. HPE Vertica Analytic Database (7.2.x) Page 30 of 145
Using the HCatalog Connector Using the HCatalog Connector The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table. If your files are in the Optimized Columnar Row (ORC) or Parquet format and do not use complex types, the HCatalog Connector creates an external table and uses the ORC or Parquet reader instead of using the Java SerDe. See Reading Native Hadoop File Formats for more information about these readers. The HCatalog Connector performs predicate pushdown to improve query performance. Instead of reading all data across the network to evaluate a query, the HCatalog Connector moves the evaluation of predicates closer to the data. Predicate pushdown applies to Hive partition pruning, ORC stripe pruning, and Parquet row-group pruning. The HCatalog Connector supports predicate pushdown for the following predicates: >, >=, =, <>, <=, <. Hive, HCatalog, and WebHCat Overview There are several Hadoop components that you need to understand in order to use the HCatalog connector: Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig). WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema. Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in Vertica that corresponds to a Hive database or schema. When you query data within this schema, HPE Vertica Analytic Database (7.2.x) Page 31 of 145
Using the HCatalog Connector the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to Vertica. You can even perform operations such as joins between Vertica-native tables and HCatalog tables. For more details, see How the HCatalog Connector Works. HCatalog Connection Features The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are: The HCatalog Connector always reflects the current state of data stored in Hive. The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive. Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster. The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table. HCatalog Connection Considerations There are a few things to keep in mind when using the HCatalog Connector: Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica through the HCatalog Connector or the WebHDFS connector. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive. HPE Vertica Analytic Database (7.2.x) Page 32 of 145
Using the HCatalog Connector Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data Type Conversions from Hive to Vertica. Note: The HCatalog Connector is read only. It cannot insert data into Hive. How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the query from HDFS. They then use the Hive SerDe classes to extract the data so the query can execute. This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster. HPE Vertica Analytic Database (7.2.x) Page 33 of 145
Using the HCatalog Connector HCatalog Connector Requirements Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements. Vertica Requirements All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the Java Runtime on Your Vertica Cluster. You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. See Configuring Vertica for HCatalog. Hadoop Requirements Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog: It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more information. It must have WebHCat (formerly known as Templeton) installed and running. See Apache' s WebHCat page for details. The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass WebHCat, metastore database, and HDFS traffic. The data that you want to query must be in an internal or external Hive table. If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. See Using Non- Standard SerDes. Testing Connectivity To test the connection between your database cluster and WebHcat, log into a node in your Vertica cluster. Then, run the following command to execute an HCatalog query: $ curl http://webhcatserver:port/templeton/v1/status?user.name=hcatusername HPE Vertica Analytic Database (7.2.x) Page 34 of 145
Using the HCatalog Connector Where: webhcatserver is the IP address or hostname of the WebHCat server port is the port number assigned to the WebHCat service (usually 50111) hcatusername is a valid username authorized to use HCatalog Usually, you want to append ;echo to the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read. For example: $ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo If there are no errors, this command returns a status message in JSON format, similar to the following: {"status":"ok","version":"v1"} This result indicates that WebHCat is running and that the Vertica host can connect to it and retrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and the connectivity between your Hadoop and Vertica clusters. For details, see Troubleshooting HCatalog Connector Problems. You can also run some queries to verify that WebHCat is correctly configured to work with Hive. The following example demonstrates listing the databases defined in Hive and the tables defined within a database: $ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo {"databases":["default","production"]} $ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo {"tables":["messages","weblogs","tweets","transactions"],"database":"default"} See Apache's WebHCat reference for details about querying Hive using WebHCat. Installing the Java Runtime on Your Vertica Cluster The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes. HPE Vertica Analytic Database (7.2.x) Page 35 of 145
Using the HCatalog Connector Note: If your Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See Developing User Defined Functions in Java in Extending Vertica for more information. Installing Java on your Vertica cluster is a two-step process: 1. Install a Java runtime on all of the hosts in your cluster. 2. Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable. Installing a Java Runtime For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE. Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK. To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions. Once you have installed a JVM on each host, ensure that the java command is in the search path and calls the correct JVM by running the command: $ java -version This command should print something similar to: java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode) Note: Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime. HPE Vertica Analytic Database (7.2.x) Page 36 of 145
Using the HCatalog Connector Setting the JavaBinaryForUDx Configuration Parameter The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell: $ which java /usr/bin/java If the java command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default (which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java. You set the configuration parameter by executing the following statement as a database superuser: => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; See ALTER DATABASE for more information on setting configuration parameters. To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table: => \x Expanded display is on. => SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx'; -[ RECORD 1 ]-----------------+---------------------------------------------------------- node_name ALL parameter_name JavaBinaryForUDx current_value /usr/bin/java default_value change_under_support_guidance f change_requires_restart f description Path to the java binary for executing UDx written in Java Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster. Note: Since the location of the Java executable is set by a single configuration HPE Vertica Analytic Database (7.2.x) Page 37 of 145
Using the HCatalog Connector parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster. Configuring Vertica for HCatalog Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those configuration files to make its own connections to Hadoop. You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector. Copy Hadoop Libraries and Configuration Files Vertica provides a tool, hcatutil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries: If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so. If you plan to use Hive to query files that use LZO compression, you need access to the hadoop-lzo-*.jar and libgplcompression.so* libraries. In core-site.xml you must also edit the io.compression.codecs property to include com.hadoop.compression.lzo.lzopcodec. If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example: hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar; hive> create external table nationjson (id int,name string,rank int,text string) ROW FORMAT SERDE 'org.openx.data.jsonserde.jsonserde' LOCATION '/user/release/vt/nationjson'; If you are using any other libraries that are not standard across all supported Hadoop versions, you need access to those libraries. HPE Vertica Analytic Database (7.2.x) Page 38 of 145
Using the HCatalog Connector If any of these cases applies to you, do one of the following: Include the path(s) in the path you specify as the value of --hcatlibpath, or Copy the file(s) to a directory already on that path. If Vertica is not co-located on a Hadoop node, you should do the following: 1. Copy /opt/vertica/packages/hcat/tools/hcatutil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory. Use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/tmp/hadoop-files 2. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/tmp/hadoop-files 3. Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar. 4. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/opt/vertica/packages/hcat If Vertica is co-located on some or all Hadoop nodes, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/opt/vertica/packages/hcat/lib The hcatutil script has the following arguments: HPE Vertica Analytic Database (7.2.x) Page 39 of 145
Using the HCatalog Connector -c, --copyjars -v, --verifyjars --hadoophivehome= "value1;value2;..." copy the required JARs from hadoophivepath to hcatlibpath. verify that the required JARs are present in hcatlibpath. paths to the Hadoop, Hive, and HCatalog home directories. You must include the HADOOP_HOME and HIVE_ HOME paths. Separate multiple paths by a semicolon (;). Enclose paths in double quotes. In newer versions of Hadoop, look for the HCatalog directory under the HIVE directory (for example, /hive/hcatalog/share). -- hcatlibpath= "value1;value2;..." -- hadoophiveconfpath= "value" output path of the lib/ folder of the HCatalog dependency JARs. Usually this is /opt/vertica/packages/hcat. You may use any folder, but make sure to copy all JARs to the hcat/lib folder before installing the HCatalog connector. If you have previously run hcatutil with a different version of Hadoop, remove the old JAR files first (all except verticahcatalogudl.jar). paths of the Hadoop, HIVE, and other components' configuration files (such as core-site.xml, hive-site.xml, and webhcat-site.xml). Separate multiple paths by a semicolon (;). Enclose paths in double quotes. These files contain values that would otherwise have to be specified to CREATE HCATALOG SCHEMA. If you are using Cloudera, or if your HDFS cluster uses Kerberos authentication, this parameter is required. Otherwise this parameter is optional. Once you have copied the files and verified them, install the HCatalog connector. Install the HCatalog Connector On the same node where you copied the files from hcatutil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under HPE Vertica Analytic Database (7.2.x) Page 40 of 145
Using the HCatalog Connector your HCatalog connector installation path. This script creates the library and VHCatSource and VHCatParser. Note: The data that was copied using hcatutil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatutil and install.sql. The following statement returns the names of the libraries and configuration files currently being used: => SELECT dependencies FROM user_libraries WHERE lib_name='vhcataloglib'; Now you can create HCatalog schema parameters, which point to your existing Hadoop/Hive/WebHCat services, as described in Defining a Schema Using the HCatalog Connector. Upgrading to a New Version of Vertica After upgrading to a new version of Vertica, perform the following steps: 1. Uninstall the HCatalog Connector using the uninstall.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. 2. Delete the contents of the hcatlibpath directory. 3. Rerun hcatutil. 4. Reinstall the HCatalog Connector using the install.sql script. For more information about upgrading Vertica, see Upgrading Vertica to a New Version. Additional Options for Native File Formats When reading Hadoop native file formats (ORC or Parquet), the HCatalog Connector attempts to use the built-in readers. When doing so, it uses the webhdfs scheme by default. You do not need to make any additional changes to support this. You can instruct the HCatalog Connector to use the hdfs scheme instead by using ALTER DATABASE to set HCatalogConnectorLibHDFSPP to true. If you change this setting, you must also perform the configuration described in Configuring hdfs:/// Access. HPE Vertica Analytic Database (7.2.x) Page 41 of 145
Using the HCatalog Connector Using the HCatalog Connector with HA NameNode Newer distributions of Hadoop support the High Availability NameNode (HA NN) for HDFS access. Some additional configuration is required to use this feature with the HCatalog Connector. If you do not perform this configuration, attempts to retrieve data through the connector will produce an error. To use HA NN with Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in your Vertica cluster. You can put this directory anywhere, but it must be in the same location on every node. (In the example below it is in /opt/hcat/hadoop_conf.) Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall the library: => \i /opt/vertica/packages/hcat/ddl/uninstall.sql DROP LIBRARY => ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf'; WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated; setting it has no effect => \i /opt/vertica/packages/hcat/ddl/install.sql CREATE LIBRARY CREATE SOURCE FUNCTION GRANT PRIVILEGE CREATE PARSER FUNCTION GRANT PRIVILEGE Despite the warning message, this step is necessary. After taking these steps, HCatalog queries will now work. Defining a Schema Using the HCatalog Connector After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database. You define the schema using the CREATE HCATALOG SCHEMA statement. When creating the schema, you must supply at least two pieces of information: HPE Vertica Analytic Database (7.2.x) Page 42 of 145
Using the HCatalog Connector the name of the schema to define in Vertica the host name or IP address of Hive's metastore database (the database server that contains metadata about Hive's data, such as the schema and table definitions) Other parameters are optional. If you do not supply a value, Vertica uses default values. After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. See Viewing Hive Schema and Table Metadata for more information about these tables. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser'; CREATE SCHEMA => -- Show list of all HCatalog schemas => \x Expanded display is on. => SELECT * FROM v_catalog.hcatalog_schemata; -[ RECORD 1 ]--------+------------------------------ schema_id 45035996273748980 schema_name hcat schema_owner_id 45035996273704962 schema_owner dbadmin create_time 2013-11-04 15:09:03.504094-05 hostname hcathost port 9933 webservice_hostname hcathost webservice_port 50111 hcatalog_schema_name default hcatalog_user_name hcatuser metastore_db_name hivemetastoredb => -- List the tables in all HCatalog schemas => SELECT * FROM v_catalog.hcatalog_table_list; -[ RECORD 1 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat hcatalog_schema default table_name messages hcatalog_user_name hcatuser -[ RECORD 2 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat hcatalog_schema default table_name weblog hcatalog_user_name hcatuser -[ RECORD 3 ]------+------------------ table_schema_id 45035996273748980 table_schema hcat hcatalog_schema default table_name tweets HPE Vertica Analytic Database (7.2.x) Page 43 of 145
Using the HCatalog Connector hcatalog_user_name hcatuser Querying Hive Tables Using HCatalog Connector Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query. => SELECT * from hcat.messages limit 10; messageid userid time message -----------+------------+---------------------+---------------------------------- 1 npfq1ayhi 2013-10-29 00:10:43 hymenaeos cursus lorem Suspendis 2 N7svORIoZ 2013-10-29 00:21:27 Fusce ad sem vehicula morbi 3 4VvzN3d 2013-10-29 00:32:11 porta Vivamus condimentum 4 heojkmtmc 2013-10-29 00:42:55 lectus quis imperdiet 5 corows3of 2013-10-29 00:53:39 sit eleifend tempus a aliquam mauri 6 odrp1i 2013-10-29 01:04:23 risus facilisis sollicitudin sceler 7 AU7a9Kp 2013-10-29 01:15:07 turpis vehicula tortor 8 ZJWg185DkZ 2013-10-29 01:25:51 sapien adipiscing eget Aliquam tor 9 E7ipAsYC3 2013-10-29 01:36:35 varius Cum iaculis metus 10 kstcv 2013-10-29 01:47:19 aliquam libero nascetur Cum mal (10 rows) Since the tables you access through the HCatalog Connector act like Vertica tables, you can perform operations that use both Hive data and native Vertica data, such as a join: => SELECT u.firstname, u.lastname, d.time, d.message from UserData u -> JOIN hcat.messages d ON u.userid = d.userid LIMIT 10; FirstName LastName time Message ----------+----------+---------------------+----------------------------------- Whitney Kerr 2013-10-29 00:10:43 hymenaeos cursus lorem Suspendis Troy Oneal 2013-10-29 00:32:11 porta Vivamus condimentum Renee Coleman 2013-10-29 00:42:55 lectus quis imperdiet Fay Moss 2013-10-29 00:53:39 sit eleifend tempus a aliquam mauri Dominique Cabrera 2013-10-29 01:15:07 turpis vehicula tortor Mohammad Eaton 2013-10-29 00:21:27 Fusce ad sem vehicula morbi Cade Barr 2013-10-29 01:25:51 sapien adipiscing eget Aliquam tor Oprah Mcmillan 2013-10-29 01:36:35 varius Cum iaculis metus Astra Sherman 2013-10-29 01:58:03 dignissim odio Pellentesque primis Chelsea Malone 2013-10-29 02:08:47 pede tempor dignissim Sed luctus (10 rows) Viewing Hive Schema and Table Metadata When using Hive, you access metadata about schemas and tables by executing statements written in HiveQL (Hive's version of SQL) such as SHOW TABLES. When using the HCatalog Connector, you can get metadata about the tables in the Hive database through several Vertica system tables. HPE Vertica Analytic Database (7.2.x) Page 44 of 145
Using the HCatalog Connector There are four system tables that contain metadata about the tables accessible through the HCatalog Connector: HCATALOG_SCHEMATA lists all of the schemas that have been defined using the HCatalog Connector. See HCATALOG_SCHEMATA in the SQL Reference Manual for detailed information. HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemas defined using the HCatalog Connector. This table only shows the tables which the user querying the table can access. The information in this table is retrieved using a single call to WebHCat for each schema defined using the HCatalog Connector, which means there is a little overhead when querying this table. See HCATALOG_TABLE_LIST in the SQL Reference Manual for detailed information. HCATALOG_TABLES contains more in-depth information than HCATALOG_ TABLE_LIST. However, querying this table results in Vertica making a REST web service call to WebHCat for each table available through the HCatalog Connector. If there are many tables in the HCatalog schemas, this query could take a while to complete. See HCATALOG_TABLES in the SQL Reference Manual for more information. HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results in one call to WebHCat per table, and therefore can take a while to complete. See HCATALOG_COLUMNS in the SQL Reference Manual for more information. The following example demonstrates querying the system tables containing metadata for the tables available through the HCatalog Connector. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' -> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser'; CREATE SCHEMA => SELECT * FROM HCATALOG_SCHEMATA; -[ RECORD 1 ]--------+----------------------------- schema_id 45035996273864536 schema_name hcat schema_owner_id 45035996273704962 schema_owner dbadmin create_time 2013-11-05 10:19:54.70965-05 hostname hcathost HPE Vertica Analytic Database (7.2.x) Page 45 of 145
Using the HCatalog Connector port 9083 webservice_hostname hcathost webservice_port 50111 hcatalog_schema_name default hcatalog_user_name hcatuser metastore_db_name hivemetastoredb => SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name hcatalogtypes hcatalog_user_name hcatuser -[ RECORD 2 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name tweets hcatalog_user_name hcatuser -[ RECORD 3 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name messages hcatalog_user_name hcatuser -[ RECORD 4 ]------+------------------ table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name msgjson hcatalog_user_name hcatuser => -- Get detailed description of a specific table => SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson'; -[ RECORD 1 ]---------+----------------------------------------------------------- table_schema_id 45035996273864536 table_schema hcat hcatalog_schema default table_name msgjson hcatalog_user_name hcatuser min_file_size_bytes 13524 total_number_files 10 location hdfs://hive.example.com:8020/user/exampleuser/msgjson last_update_time 2013-11-05 14:18:07.625-05 output_format org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat last_access_time 2013-11-11 13:21:33.741-05 max_file_size_bytes 45762 is_partitioned f partition_expression table_owner hcatuser input_format org.apache.hadoop.mapred.textinputformat total_file_size_bytes 453534 hcatalog_group supergroup permission rwxr-xr-x => -- Get list of columns in a specific table => SELECT * FROM HCATALOG_COLUMNS WHERE table_name = 'hcatalogtypes' -> ORDER BY ordinal_position; -[ RECORD 1 ]------------+----------------- HPE Vertica Analytic Database (7.2.x) Page 46 of 145
Using the HCatalog Connector table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name intcol hcatalog_data_type int data_type int data_type_id 6 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 1 -[ RECORD 2 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name floatcol hcatalog_data_type float data_type float data_type_id 7 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 2 -[ RECORD 3 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name doublecol hcatalog_data_type double data_type float data_type_id 7 data_type_length 8 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 3 -[ RECORD 4 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name charcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale HPE Vertica Analytic Database (7.2.x) Page 47 of 145
Using the HCatalog Connector datetime_precision interval_precision ordinal_position 4 -[ RECORD 5 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name varcharcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 5 -[ RECORD 6 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name boolcol hcatalog_data_type boolean data_type boolean data_type_id 5 data_type_length 1 character_maximum_length numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 6 -[ RECORD 7 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name timestampcol hcatalog_data_type string data_type varchar(65000) data_type_id 9 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 7 -[ RECORD 8 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name varbincol hcatalog_data_type binary data_type varbinary(65000) data_type_id 17 HPE Vertica Analytic Database (7.2.x) Page 48 of 145
Using the HCatalog Connector data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 8 -[ RECORD 9 ]------------+----------------- table_schema hcat hcatalog_schema default table_name hcatalogtypes is_partition_column f column_name bincol hcatalog_data_type binary data_type varbinary(65000) data_type_id 17 data_type_length 65000 character_maximum_length 65000 numeric_precision numeric_scale datetime_precision interval_precision ordinal_position 9 Synchronizing an HCatalog Schema or Table With a Local Schema or Table Querying data from an HCatalog schema can be slow due to Hive and WebHCat performance issues. This slow performance can be especially annoying when you want to examine the structure of the tables in the Hive database. Getting this information from Hive requires you to query the HCatalog schema's metadata using the HCatalog Connector. To avoid this performance problem you can use the SYNC_WITH_HCATALOG_ SCHEMA function to create a snapshot of the HCatalog schema's metadata within a Vertica schema. You supply this function with the name of a pre-existing Vertica schema, typically the one created through CREATE HCATALOG SCHEMA, and a Hive schema available through the HCatalog Connector. The function creates a set of external tables within the Vertica schema that you can then use to examine the structure of the tables in the Hive database. Because the metadata in the Vertica schema is local, query planning is much faster. You can also use standard Vertica statements and system-table queries to examine the structure of Hive tables in the HCatalog schema. Caution: The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the Vertica schema whose names match a table in the HCatalog schema. Do not HPE Vertica Analytic Database (7.2.x) Page 49 of 145
Using the HCatalog Connector use the Vertica schema to store other data. When SYNC_WITH_HCATALOG_SCHEMA creates tables in Vertica, it matches Hive's STRING and BINARY types to Vertica's VARCHAR(65000) and VARBINARY(65000) types. You might want to change these lengths, using ALTER TABLE SET DATA TYPE, in two cases: If the value in Hive is larger than 65000 bytes, increase the size and use LONG VARCHAR or LONG VARBINARY to avoid data truncation. If a Hive string uses multi-byte encodings, you must increase the size in Vertica to avoid data truncation. This step is needed because Hive counts string length in characters while Vertica counts it in bytes. If the value in Hive is much smaller than 65000 bytes, reduce the size to conserve memory in Vertica. The Vertica schema is just a snapshot of the HCatalog schema's metadata. Vertica does not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_WITH_HCATALOG_SCHEMA. You can call the function again to resynchronize the local schema to the HCatalog schema. If you altered column data types, you will need to repeat those changes because the function creates new external tables. By default, SYNC_WITH_HCATALOG_SCHEMA does not drop tables that appear in the local schema that do not appear in the HCatalog schema. Thus, after the function call the local schema does not reflect tables that have been dropped in the Hive database since the previous call. You can change this behavior by supplying the optional third Boolean argument that tells the function to drop any table in the local schema that does not correspond to a table in the HCatalog schema. Instead of synchronizing the entire schema, you can synchronize individual tables by using SYNC_WITH_HCATALOG_SCHEMA_TABLE. If the table already exists in Vertica the function overwrites it. If the table is not found in the HCatalog schema, this function returns an error. In all other respects this function behaves in the same way as SYNC_WITH_HCATALOG_SCHEMA. Examples The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to synchronize the HCatalog schema in Vertica with the metadata in Hive. Because it HPE Vertica Analytic Database (7.2.x) Page 50 of 145
Using the HCatalog Connector synchronizes the HCatalog schema directly, instead of synchronizing another schema with the HCatalog schema, both arguments are the same. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser'; CREATE SCHEMA => SELECT sync_with_hcatalog_schema('hcat', 'hcat'); sync_with_hcatalog_schema ---------------------------------------- Schema hcat synchronized with hcat tables in hcat = 56 tables altered in hcat = 0 tables created in hcat = 56 stale tables in hcat = 0 table changes erred in hcat = 0 (1 row) => -- Use vsql's \d command to describe a table in the synced schema => \d hcat.messages List of Fields by Tables Schema Table Column Type Size Default Not Null Primary Key Foreign Key -----------+----------+---------+----------------+-------+---------+----------+-------------+------ ------- hcat messages id int 8 f f hcat messages userid varchar(65000) 65000 f f hcat messages "time" varchar(65000) 65000 f f hcat messages message varchar(65000) 65000 f f (4 rows) This example shows synchronizing with a schema created using CREATE HCATALOG SCHEMA. Synchronizing with a schema created using CREATE SCHEMA is also supported. You can query tables in the local schema that you synchronized with an HCatalog schema. However, querying tables in a synchronized schema isn't much faster than directly querying the HCatalog schema, because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog schema's metadata. The data in the table is still retrieved using the HCatalog Connector, Data Type Conversions from Hive to Vertica The data types recognized by Hive differ from the data types recognized by Vertica. The following table lists how the HCatalog Connector converts Hive data types into data types compatible with Vertica. HPE Vertica Analytic Database (7.2.x) Page 51 of 145
Using the HCatalog Connector Hive Data Type TINYINT (1-byte) SMALLINT (2-bytes) INT (4-bytes) BIGINT (8-bytes) BOOLEAN FLOAT (4-bytes) DECIMAL (precision, scale) DOUBLE (8-bytes) CHAR (length in characters) VARCHAR (length in characters) Vertica Data Type TINYINT (8-bytes) SMALLINT (8-bytes) INT (8-bytes) BIGINT (8-bytes) BOOLEAN FLOAT (8-bytes) DECIMAL (precision, scale) DOUBLE PRECISION (8-bytes) CHAR (length in bytes) VARCHAR (length in bytes), if length <= 65000 LONG VARCHAR (length in bytes), if length > 65000 STRING (2 GB max) VARCHAR (65000) BINARY (2 GB max) VARBINARY (65000) DATE TIMESTAMP LIST/ARRAY MAP STRUCT DATE TIMESTAMP VARCHAR (65000) containing a JSON-format representation of the list. VARCHAR (65000) containing a JSON-format representation of the map. VARCHAR (65000) containing a JSON-format representation of the struct. HPE Vertica Analytic Database (7.2.x) Page 52 of 145
Using the HCatalog Connector Data-Width Handling Differences Between Hive and Vertica The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS. Therefore, the data read from these files are subject to Hive's data width restrictions. For example, suppose the SerDe parses a value for an INT column into a value that is greater than 2 32-1 (the maximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into a Vertica's 64-bit INTEGER column because it cannot fit into Hive's 32-bit INT. Hive measures CHAR and VARCHAR length in characters and Vertica measures them in bytes. Therefore, if multi-byte encodings are being used (like Unicode), text might be truncated in Vertica. Once the value has been parsed and converted to a Vertica data type, it is treated as native data. This treatment can result in some confusion when comparing the results of an identical query run in Hive and in Vertica. For example, if your query adds two INT values that result in a value that is larger than 2 32-1, the value overflows its 32-bit INT data type, causing Hive to return an error. When running the same query with the same data in Vertica using the HCatalog Connector, the value will probably still fit within Vertica's 64-int value. Thus the addition is successful and returns a value. Using Non-Standard SerDes Hive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS). When you execute a Hive query, it uses a set of serializer and deserializer (SerDe) classes to extract data from these flat files and organize it into a relational database table. For Hive to be able to extract data from a file, it must have a SerDe that can parse the data the file contains. When you create a table in Hive, you can select the SerDe to be used for the table's data. Hive has a set of standard SerDes that handle data in several formats such as delimited data and data extracted using regular expressions. You can also use third-party or custom-defined SerDes that allow Hive to process data stored in other file formats. For example, some commonly-used third-party SerDes handle data stored in JSON format. The HCatalog Connector directly fetches file segments from HDFS and uses Hive's SerDes classes to extract data from them. The Connector includes all Hive's standard SerDes classes, so it can process data stored in any file that Hive natively supports. If HPE Vertica Analytic Database (7.2.x) Page 53 of 145
Using the HCatalog Connector you want to query data from a Hive table that uses a custom SerDe, you must first install the SerDe classes on the Vertica cluster. Determining Which SerDe You Need If you have access to the Hive command line, you can determine which SerDe a table uses by using Hive's SHOW CREATE TABLE statement. This statement shows the HiveQL statement needed to recreate the table. For example: hive> SHOW CREATE TABLE msgjson; OK CREATE EXTERNAL TABLE msgjson( messageid int COMMENT 'from deserializer', userid string COMMENT 'from deserializer', time string COMMENT 'from deserializer', message string COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.jsonserde' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.textinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' LOCATION 'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson' TBLPROPERTIES ( 'transient_lastddltime'='1384194521') Time taken: 0.167 seconds In the example, ROW FORMAT SERDE indicates that a special SerDe is used to parse the data files. The next row shows that the class for the SerDe is named org.apache.hadoop.hive.contrib.serde2.jsonserde.you must provide the HCatalog Connector with a copy of this SerDe class so that it can read the data from this table. You can also find out which SerDe class you need by querying the table that uses the custom SerDe. The query will fail with an error message that contains the class name of the SerDe needed to parse the data in the table. In the following example, the portion of the error message that names the missing SerDe class is in bold. => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: java.lang.runtimeexception: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.jsonserde does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check HPE Vertica Analytic Database (7.2.x) Page 54 of 145
Using the HCatalog Connector UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.hcatalogsplitsnoopsourcefactory.plan(hcatalogsplitsnoopsourcefactory.java:98) at com.vertica.udxfence.udxexeccontext.planudsource(udxexeccontext.java:898)... Installing the SerDe on the Vertica Cluster You usually have two options to getting the SerDe class file the HCatalog Connector needs: Find the installation files for the SerDe, then copy those over to your Vertica cluster. For example, there are several third-party JSON SerDes available from sites like Google Code and GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then download the package and copy it to your Vertica cluster. Directly copy the JAR files from a Hive server onto your Vertica cluster. The location for the SerDe JAR files depends on your Hive installation. On some systems, they may be located in /usr/lib/hive/lib. Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory on every node in your Vertica cluster. Important: If you add a new host to your Vertica cluster, remember to copy every custom SerDer JAR file to it. Troubleshooting HCatalog Connector Problems You may encounter the following issues when using the HCatalog Connector. Connection Errors When you use CREATE HCATALOG SCHEMA to create a new schema, the HCatalog Connector does not immediately attempt to connect to the WebHCat or metastore servers. Instead, when you execute a query using the schema or HCatalog-related system tables, the connector attempts to connect to and retrieve data from your Hadoop cluster. HPE Vertica Analytic Database (7.2.x) Page 55 of 145
Using the HCatalog Connector The types of errors you get depend on which parameters are incorrect. Suppose you have incorrect parameters for the metastore database, but correct parameters for WebHCat. In this case, HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The following example demonstrates creating an HCatalog schema with the correct default WebHCat information. However, the port number for the metastore database is incorrect. => CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost' -> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234; CREATE SCHEMA => SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]------+--------------------- table_schema_id 45035996273864536 table_schema hcat2 hcatalog_schema default table_name test hcatalog_user_name hive => SELECT * FROM hcat2.test; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.ttransportexception: java.net.connectexception: Connection refused at org.apache.thrift.transport.tsocket.open(tsocket.java:185) at org.apache.hadoop.hive.metastore.hivemetastoreclient.open( HiveMetaStoreClient.java:277)... To resolve these issues, you must drop the schema and recreate it with the correct parameters. If you still have issues, determine whether there are connectivity issues between your Vertica cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more Vertica hosts from contacting the WebHCat, metastore, or HDFS hosts. You may also see this error if you are using HA NameNode, particularly with larger tables that HDFS splits into multiple blocks. See Using the HCatalog Connector with HA NameNode for more information about correcting this problem. UDx Failure When Querying Data: Error 3399 You might see an error message when querying data (as opposed to metadata like schema information). This might be accompanied by a ClassNotFoundException in the log. This can happen for the following reasons: HPE Vertica Analytic Database (7.2.x) Page 56 of 145
Using the HCatalog Connector You are not using the same version of Java on your Hadoop and Vertica nodes. In this case you need to change one of them to match the other. You have not used hcatutil to copy all Hadoop and Hive libraries to Vertica, or you ran hcatutil and then changed your version of Hadoop or Hive. You upgraded Vertica to a new version and did not rerun hcatutil and reinstall the HCatalog Connector. The version of Hadoop you are using relies on a third-party library that you must copy manually. You are reading files with LZO compression and have not copied the libraries or set the io.compression.codecs property in core-site.xml. If you did not copy the libraries or configure LZO compression, follow the instructions in Configuring Vertica for HCatalog. If the Hive jars that you copied from Hadoop are out of date, you might see an error message like the following: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.jobcontext, but class was expected ] HINT hive metastore service is thrift://localhost:13433 (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) This usually signals a problem with hive-hcatalog-core jar. Make sure you have an up-to-date copy of this file. Remember that if you rerun hcatutil you also need to recreate the HCatalog schema. You might also see a different form of this error: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ javax/servlet/filter ] This error can be reported even if hcatutil reports that your libraries are up to date. The javax.servlet.filter class is in a library that some versions of Hadoop use but that is not usually part of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar on a Hadoop node and copy it to the hcat/lib directory on all database nodes. If you cannot locate it on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version must be 2.3 or higher. HPE Vertica Analytic Database (7.2.x) Page 57 of 145
Using the HCatalog Connector Once you have copied the jar to the hcat/lib directory, reinstall the HCatalog connector as explained in Configuring Vertica for HCatalog. SerDe Errors Errors can occur if you attempt to query a Hive table that uses a non-standard SerDe. If you have not installed the SerDe JAR files on your Vertica cluster, you receive an error similar to the one in the following example: => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.udfexception: Error message is [ org.apache.hcatalog.common.hcatexception : 2004 : HCatOutputFormat not initialized, setoutput has to be called. Cause : java.io.ioexception: java.lang.runtimeexception: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.jsonserde does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.hcatalogsplitsnoopsourcefactory.plan(hcatalogsplitsnoopsourcefactory.java:98) at com.vertica.udxfence.udxexeccontext.planudsource(udxexeccontext.java:898)... In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To resolve this issue, install the SerDe class on your Vertica cluster. See Using Non-Standard SerDes for more information. This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe class. Differing Results Between Hive and Vertica Queries Sometimes, running the same query on Hive and on Vertica through the HCatalog Connector can return different results. This discrepancy is often caused by the differences between the data types supported by Hive and Vertica. See Data Type Conversions from Hive to Vertica for more information about supported data types. If Hive string values are being truncated in Vertica, this might be caused by multi-byte character encodings in Hive. Hive reports string length in characters, while Vertica records it in bytes. For a two-byte encoding such as Unicode, you need to double the column size in Vertica to avoid truncation. HPE Vertica Analytic Database (7.2.x) Page 58 of 145
Using the HCatalog Connector Discrepancies can also occur if the Hive table uses partition columns of types other than string. Preventing Excessive Query Delays Network issues or high system loads on the WebHCat server can cause long delays while querying a Hive database using the HCatalog Connector. While Vertica cannot resolve these issues, you can set parameters that limit how long Vertica waits before canceling a query on an HCatalog schema. You can set these parameters globally using Vertica configuration parameters. You can also set them for specific HCatalog schemas in the CREATE HCATALOG SCHEMA statement. These specific settings override the settings in the configuration parameters. The HCatConnectionTimeout configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the HCatalog Connector waits for a connection to the WebHCat server. A value of 0 (the default setting for the configuration parameter) means to wait indefinitely. If the WebHCat server does not respond by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the query. If you find that some queries on an HCatalog schema pause excessively, try setting this parameter to a timeout value, so the query does not hang indefinitely. The HCatSlowTransferTime configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_TIME parameter specify how long the HCatlog Connector waits for data after making a successful connection to the WebHCat server. After the specified time has elapsed, the HCatalog Connector determines whether the data transfer rate from the WebHCat server is at least the value set in the HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_ LIMIT parameter). If it is not, then the HCatalog Connector terminates the connection and cancels the query. You can set these parameters to cancel queries that run very slowly but do eventually complete. However, query delays are usually caused by a slow connection rather than a problem establishing the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of the issue is connections that never complete, you can alternately adjust the Linux TCP socket timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter. HPE Vertica Analytic Database (7.2.x) Page 59 of 145
Using the HDFS Connector Using the HDFS Connector The Hadoop Distributed File System (HDFS) is the location where Hadoop usually stores its input and output files. It stores files across the Hadoop cluster redundantly, to keep the files available even if some nodes are down. HDFS also makes Hadoop more efficient, by spreading file access tasks across the cluster to help limit I/O bottlenecks. The HDFS Connector lets you load files from HDFS into Vertica using the COPY statement. You can also create external tables that access data stored on HDFS as if it were a native Vertica table. The connector is useful if your Hadoop job does not directly store its data in Vertica using the MapReduce Connector (see Using the MapReduce Connector) or if you want to use User-Defined Extensions (UDxs) to load data stored in HDFS. Note: The files you load from HDFS using the HDFS Connector usually have a delimited format. Column values are separated by a character, such as a comma or a pipe character ( ). This format is the same type used in other files you load with the COPY statement. Hadoop MapReduce jobs often output tab-delimited files. Like the MapReduce Connector, the HDFS Connector takes advantage of the distributed nature of both Vertica and Hadoop. Individual nodes in the Vertica cluster connect directly to nodes in the Hadoop cluster when you load multiple files from HDFS. Hadoop splits large files into file segments that it stores on different nodes. The connector directly retrieves these file segments from the node storing them, rather than relying on the Hadoop cluster to reassemble the file. The connector is read-only; it cannot write data to HDFS. The HDFS Connector can connect to a Hadoop cluster through unauthenticated and Kerberos-authenticated connections. HDFS Connector Requirements Uninstall Prior Versions of the HDFS Connector The HDFS Connector is now installed with Vertica; you no longer need to download and install it separately. If you have previously downloaded and installed this connector, uninstall it before you upgrade to this release of Vertica to get the newest version. HPE Vertica Analytic Database (7.2.x) Page 60 of 145
Using the HDFS Connector webhdfs Requirements The HDFS Connector connects to the Hadoop file system using webhdfs, a built-in component of HDFS that provides access to HDFS files to applications outside of Hadoop. This component must be enabled on your Hadoop cluster. See your Hadoop distribution's documentation for instructions on configuring and enabling webhdfs. Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS. It relies on a separate server process that receives requests for files and retrieves them from the HDFS. Since it uses a REST API that is compatible with webhdfs, it could theoretically work with the connector. However, the connector has not been tested with HTTPfs and HPE does not support using the HDFS Connector with HTTPfs. In addition, since all of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficient than webhdfs, which lets Vertica nodes directly connect to the Hadoop nodes storing the file blocks. Kerberos Authentication Requirements The HDFS Connector can connect to HDFS using Kerberos authentication. To use Kerberos, you must meet these additional requirements: Your Vertica installation must be Kerberos-enabled. Your Hadoop cluster must be configured to use Kerberos authentication. Your connector must be able to connect to the Kerberos-enabled Hadoop cluster. The Kerberos server must be running version 5. The Kerberos server must be accessible from every node in your Vertica cluster. You must have Kerberos principals (users) that map to Hadoop users. You use these principals to authenticate your Vertica users with the Hadoop cluster. Testing Your Hadoop webhdfs Configuration To ensure that your Hadoop installation's WebHDFS system is configured and running, follow these steps: HPE Vertica Analytic Database (7.2.x) Page 61 of 145
Using the HDFS Connector 1. Log into your Hadoop cluster and locate a small text file on the Hadoop filesystem. If you do not have a suitable file, you can create a file named test.txt in the /tmp directory using the following command: echo -e "A 1 2 3\nB 4 5 6" hadoop fs -put - /tmp/test.txt 2. Log into a host in your Vertica database using the database administrator account. 3. If you are using Kerberos authentication, authenticate with the Kerberos server using the keytab file for a user who is authorized to access the file. For example, to authenticate as an user named exampleuser@mycompany.com, use the command: $ kinit exampleuser@mycompany.com -k -t /path/exampleuser.keytab Where path is the path to the keytab file you copied over to the node. You do not receive any message if you authenticate successfully. You can verify that you are authenticated by using the klist command: $ klistticket cache: FILE:/tmp/krb5cc_500 Default principal: exampleuser@mycompany.com Valid starting Expires Service principal 07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/mycompany.com@mycompany.com renew until 07/24/13 14:30:19 4. Test retrieving the file: If you are not using Kerberos authentication, run the following command from the Linux command line: curl -i -L "http://hadoopnamenode:50070/webhdfs/v1/tmp/test.txt?op=open&user.name=hadoopusername" Replacing hadoopnamenode with the hostname or IP address of the name node in your Hadoop cluster, /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1, and hadoopusername with the user name of a Hadoop user that has read access to the file. If successful, the command produces output similar to the following: HPE Vertica Analytic Database (7.2.x) Page 62 of 145
Using the HDFS Connector HTTP/1.1 200 OKServer: Apache-Coyote/1.1 Set-Cookie: hadoop.auth="u=hadoopuser&p=password&t=simple&e=1344383263490&s=n8yb/chfg56qnmrqrtqo0idrmve ="; Version=1; Path=/ Content-Type: application/octet-stream Content-Length: 16 Date: Tue, 07 Aug 2012 13:47:44 GMT A 1 2 3 B 4 5 6 If you are using Kerberos authentication, run the following command from the Linux command line: curl --negotiate -i -L -u:anyuser http://hadoopnamenode:50070/webhdfs/v1/tmp/test.txt?op=open Replace hadoopnamenode with the hostname or IP address of the name node in your Hadoop cluster, and /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1. If successful, the command produces output similar to the following: HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8 WWW-Authenticate: Negotiate Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream Expires: Thu, 01-Jan-1970 00:00:00 GMT Set-Cookie: hadoop.auth="u=exampleuser&p=exampleuser@mycompany.com&t=kerberos& e=1375144834763&s=iy52irvjuuoz5iyg8g5g12o2vwo=";path=/ Location: http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.txt? op=open&delegation=jaahcmvszwfzzqdyzwxlyxnlaiobqcrfpdgkaubo7cnrju3tbbslid_osb658jfgf RpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0 Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 16 Server: Jetty(6.1.26) A 1 2 3 B 4 5 6 If the curl command fails, you must review the error messages and resolve any issues before using the Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include: HPE Vertica Analytic Database (7.2.x) Page 63 of 145
Using the HDFS Connector Verify the HDFS service's port number. Verify that the Hadoop user you specified exists and has read access to the file you are attempting to retrieve. Loading Data Using the HDFS Connector You can use the HDFS User Defined Source (UDS) in a COPY statement to load data from HDFS files. The syntax for using the HDFS UDS in a COPY statement is: COPY tablename SOURCE Hdfs(url='WebHDFSFileURL', [username='username'], [low_speed_limit=speed]); tablename WebHDFSFileURL The name of the table to receive the copied data. A string containing one or more URLs that identify the file or files to be read. See below for details. Use commas to separate multiple URLs. If a URL contains certain special characters, you must escape them: Replace any commas in the URLs with the escape sequence %2c. For example, if you are loading a file named doe,john.txt, change the file's name in the URL to doe%2cjohn.txt. Replace any single quotes with the escape sequence '''. For example, if you are loading a file named john's_notes.txt, change the file's name in the URL to john'''s_notes.txt. username speed The username of a Hadoop user that has permissions to access the files you want to copy. If you are using Kerberos, omit this argument. The minimum data transmission rate, expressed in bytes per second, that the connector allows. The connector breaks any connection between the Hadoop and Vertica clusters that transmits data slower than this rate for more than 1 minute. After the connector breaks a connection for being too slow, it attempts to connect to another node in the Hadoop cluster. This new connection can supply the data that the broken connection was retrieving. The connector terminates the HPE Vertica Analytic Database (7.2.x) Page 64 of 145
Using the HDFS Connector COPY statement and returns an error message if: It cannot find another Hadoop node to supply the data. The previous transfer attempts from all other Hadoop nodes that have the file also closed because they were too slow. Default Value: 1048576 (1MB per second transmission rate) The HDFS File URL The url parameter in the Hdfs function call is a string containing one or more commaseparated HTTP URLs. These URLS identify the files in HDFS that you want to load. The format for each URL in this string is: http://namenode:port/webhdfs/v1/hdfsfilepath NameNode Port webhdfs/v1/ HDFSFilePath The host name or IP address of the Hadoop cluster's name node. The port number on which the WebHDFS service is running. This number is usually 50070 or 14000, but may be different in your Hadoop installation. The protocol being used to retrieve the file. This portion of the URL is always the same. It tells Hadoop to use version 1 of the WebHDFS API. The path from the root of the HDFS filesystem to the file or files you want to load. This path can contain standard Linux wildcards. Important: Any wildcards you use to specify multiple input files must resolve to files only. They must not include any directories. For example, if you specify the path /user/hadoopuser/output/*, and the output directory contains a subdirectory, the connector returns an error message. The following example shows how to use the Vertica Connector for HDFS to load a single file named /tmp/test.txt. The Hadoop cluster's name node is named hadoop. => COPY testtable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt', username='hadoopuser'); HPE Vertica Analytic Database (7.2.x) Page 65 of 145
Using the HDFS Connector Rows Loaded ------------- 2 (1 row) Copying Files in Parallel The basic COPY statement in the previous example copies a single file. It runs on just a single host in the Vertica cluster because the Connector cannot break up the workload among nodes. Any data load that does not take advantage of all nodes in the Vertica cluster is inefficient. To make loading data from HDFS more efficient, spread the data across multiple files on HDFS. This approach is often natural for data you want to load from HDFS. Hadoop MapReduce jobs usually store their output in multiple files. You specify multiple files to be loaded in your Hdfs function call by: Using wildcards in the URL Supplying multiple comma-separated URLs in the url parameter of the Hdfs userdefined source function call Supplying multiple comma-separated URLs that contain wildcards Loading multiple files through the Vertica Connector for HDFS results in a efficient load. The Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If Hadoop has broken files into multiple chunks, the Vertica hosts directly connect to the nodes storing each chunk. The following example shows how to load all of the files whose filenames start with "part-" located in the /user/hadoopuser/output directory on the HDFS. If there are at least as many files in this directory as there are nodes in the Vertica cluster, all nodes in the cluster load data from the HDFS. => COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*', username='hadoopuser'); Rows Loaded ------------- 40008 (1 row) To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the URL string: HPE Vertica Analytic Database (7.2.x) Page 66 of 145
Using the HDFS Connector => COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*, http://hadoop:50070/webhdfs/v1/user/anotheruser/part-*', username='h=hadoopuser'); Rows Loaded ------------- 80016 (1 row) Note: Vertica statements must be less than 65,000 characters long. If you supply too many long URLs in a single statement, you could go over this limit. Normally, you would only approach this limit if you are automatically generating of the COPY statement using a program or script. Viewing Rejected Rows and Exceptions COPY statements that use the Vertica Connector for HDFS use the same method for recording rejections and exceptions as other COPY statements. Rejected rows and exceptions are saved to log files. These log files are stored by default in the CopyErrorLogs subdirectory in the database's catalog directory. Due to the distributed nature of the Vertica Connector for HDFS, you cannot use the ON option to force all exception and rejected row information to be written to log files on a single Vertica host. Instead, you need to collect the log files from across the hosts to review all of the exceptions and rejections generated by the COPY statement. For more about handling rejected rows, see Capturing Load Rejections and Exceptions. Creating an External Table with an HDFS Source You can use the HDFS Connector as a source for an external table that lets you directly perform queries on the contents of files on the Hadoop Distributed File System (HDFS). See Using External Tables in the Administrator's Guide for more information on external tables. If your HDFS data is in ORC or Parquet format, using the special readers for those formats might provide better performance. See Reading Native Hadoop File Formats. Using an external table to access data stored on an HDFS cluster is useful when you need to extract data from files that are periodically updated, or have additional files added on HDFS. It saves you from having to drop previously loaded data and then HPE Vertica Analytic Database (7.2.x) Page 67 of 145
Using the HDFS Connector reload the data using a COPY statement. The external table always accesses the current version of the files on HDFS. Note: An external table performs a bulk load each time it is queried. Its performance is significantly slower than querying an internal Vertica table. You should only use external tables for infrequently-run queries (such as daily reports). If you need to frequently query the content of the HDFS files, you should either use COPY to load the entire content of the files into Vertica or save the results of a query run on an external table to an internal table which you then use for repeated queries. To create an external table that reads data from HDFS, use the HDFS Use-Defined Source (UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of this statement has the same format as the COPY statement used to load data from HDFS. See Loading Data Using the HDFS Connector for more information. The following simple example shows how to create an external table that extracts data from every file in the /user/hadoopuser/example/output directory using the HDFS Connector. => CREATE EXTERNAL TABLE hadoopexample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE Hdfs(url= -> 'http://hadoop01:50070/webhdfs/v1/user/hadoopuser/example/output/*', -> username='hadoopuser'); CREATE TABLE => SELECT * FROM hadoopexample; A B C D -------+---+---+--- test1 1 2 3 test1 3 4 5 (2 rows) Later, after another Hadoop job adds contents to the output directory, querying the table produces different results: => SELECT * FROM hadoopexample; A B C D -------+----+----+---- test3 10 11 12 test3 13 14 15 test2 6 7 8 test2 9 0 10 test1 1 2 3 test1 3 4 5 (6 rows) HPE Vertica Analytic Database (7.2.x) Page 68 of 145
Using the HDFS Connector Load Errors in External Tables Normally, querying an external table on HDFS does not produce any errors if rows rejected by the underlying COPY statement (for example, rows containing columns whose contents are incompatible with the data types in the table). Rejected rows are handled the same way they are in a standard COPY statement: they are written to a rejected data file, and are noted in the exceptions file. For more information on how COPY handles rejected rows and exceptions, see Capturing Load Rejections and Exceptions in the Administrator's Guide. Rejections and exception files are created on all of the nodes that load data from the HDFS. You cannot specify a single node to receive all of the rejected row and exception information. These files are created on each Vertica node as they process files loaded through the Vertica Connector for HDFS. Note: Since the the connector is read-only, there is no way to store rejection and exception information on the HDFS. Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS) do not occur until you query the external table. The following example shows what happens if you recreate the table based on a file that does not exist on HDFS. => DROP TABLE hadoopexample; DROP TABLE => CREATE EXTERNAL TABLE hadoopexample (A INTEGER, B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt', -> username='hadoopuser'); CREATE TABLE => SELECT * FROM hadoopexample; ERROR 0: Error calling plan() in User Function HdfsFactory at [src/hdfs.cpp:222], error code: 0, message: No files match [http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt] Note that it is not until you actually query the table that the connector attempts to read the file. Only then does it return an error. HDFS ConnectorTroubleshooting Tips The following sections explain some of the common issues you may encounter when using the HDFS Connector. HPE Vertica Analytic Database (7.2.x) Page 69 of 145
Using the HDFS Connector User Unable to Connect to Kerberos- Authenticated Hadoop Cluster A user may suddenly be unable to connect to Hadoop through the connector in a Kerberos-enabled environment. This issue can be caused by someone exporting a new keytab file for the user, which invalidates existing keytab files. You can determine if invalid keytab files is the problem by comparing the key version number associated with the user's principal key in Kerberos with the key version number stored in the keytab file on the Vertica cluster. To find the key version number for a user in Kerberos: 1. From the Linux command line, start the kadmin utility (kadmin.local if you are logged into the Kerberos Key Distribution Center). Run the getprinc command for the user: $ sudo kadmin [sudo] password for dbadmin: Authenticating as principal root/admin@mycompany.com with password. Password for root/admin@mycompany.com: kadmin: getprinc exampleuser@mycompany.com Principal: exampleuser@mycompany.com Expiration date: [never] Last password change: Fri Jul 26 09:40:44 EDT 2013 Password expiration date: [none] Maximum ticket life: 1 day 00:00:00 Maximum renewable life: 0 days 00:00:00 Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/admin@mycompany.com) Last successful authentication: [never] Last failed authentication: [never] Failed password attempts: 0 Number of keys: 2 Key: vno 3, des3-cbc-sha1, no salt Key: vno 3, des-cbc-crc, no salt MKey: vno 0 Attributes: Policy: [none] In the preceding example, there are two keys stored for the user, both of which are at version number (vno) 3. 2. To get the version numbers of the keys stored in the keytab file, use the klist command: $ sudo klist -ek exampleuser.keytab Keytab name: FILE:exampleuser.keytab KVNO Principal ---- ---------------------------------------------------------------------- HPE Vertica Analytic Database (7.2.x) Page 70 of 145
Using the HDFS Connector 2 exampleuser@mycompany.com (des3-cbc-sha1) 2 exampleuser@mycompany.com (des-cbc-crc) 3 exampleuser@mycompany.com (des3-cbc-sha1) 3 exampleuser@mycompany.com (des-cbc-crc) The first column in the output lists the key version number. In the preceding example, the keytab includes both key versions 2 and 3, so the keytab file can be used to authenticate the user with Kerberos. Resolving Error 5118 When using the connector, you might receive an error message similar to the following: ERROR 5118: UDL specified no execution nodes; at least one execution node must be specified To correct this error, verify that all of the nodes in your Vertica cluster have the correct version of the HDFS Connector package installed. This error can occur if one or more of the nodes do not have the supporting libraries installed. These libraries may be missing because one of the nodes was skipped when initially installing the connector package. Another possibility is that one or more nodes have been added since the connector was installed. Transfer Rate Errors The HDFS Connector monitors how quickly Hadoop sends data to Vertica.In some cases, the data transfer speed on any connection between a node in your Hadoop cluster and a node in your Vertica cluster slows beyond a lower limit (by default, 1 MB per second). When the transfer slows beyond the lower limit, the connector breaks the data transfer. It then connects to another node in the Hadoop cluster that contains the data it was retrieving. If it cannot find another node in the Hadoop cluster to supply the data (or has already tried all of the nodes in the Hadoop cluster), the connector terminates the COPY statement and returns an error. => COPY messages SOURCE Hdfs(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser'); ERROR 3399: Failure in UDx RPC call InvokeProcessUDL(): Error calling processudl() in User Defined Object [Hdfs] at [src/hdfs.cpp:275], error code: 0, message: [Transferring rate during last 60 seconds is 172655 byte/s, below threshold 1048576 byte/s, give up. The last error message: Operation too slow. Less than 1048576 bytes/sec transferred the last 1 seconds. The URL: http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt?op=open&offset=154901544&length=113533912. The redirected URL: http://hadoop.example.com:50075/webhdfs/v1/tmp/data.txt?op=open& HPE Vertica Analytic Database (7.2.x) Page 71 of 145
Using the HDFS Connector namenoderpcaddress=hadoop.example.com:8020&length=113533912&offset=154901544.] If you encounter this error, troubleshoot the connection between your Vertica and Hadoop clusters. If there are no problems with the network, determine if either your Hadoop cluster or Vertica cluster is overloaded. If the nodes in either cluster are too busy, they may not be able to maintain the minimum data transfer rate. If you cannot resolve the issue causing the slow transfer rate, you can lower the minimum acceptable speed. To do so, set the low_speed_limit parameter for the Hdfs source. The following example shows how to set low_speed_limit to 524288 to accept transfer rates as low as 512 KB per second (half the default lower limit). => COPY messages SOURCE Hdfs(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser', low_speed_limit=524288); Rows Loaded ------------- 9891287 (1 row) When you lower the low_speed_limit parameter, the COPY statement loading data from HDFS may take a long time to complete. You can also increase the low_speed_limit setting if the network between your Hadoop cluster and Vertica cluster is fast. You can choose to increase the lower limit to force COPY statements to generate an error, if they are running more slowly than they normally should, given the speed of the network. Error Loading Many Files When using the HDFS Connector to load many data files in a single statement, you might receive an error message similar to the following: RROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error calling planudl() in User Defined Object [Hdfs] at [src/glob.cpp:531], error code: 0, message: Error occurs in Glob::stat: Last error message before give up: Failed to connect to 10.20.41.212: Cannot assign requested address. This can happen when concurrent load requests overwhelm the Name Node. It is generally safe to load hundreds of files at a time, but if you load thousands you might see this error. Use smaller batches of files to avoid this error. HPE Vertica Analytic Database (7.2.x) Page 72 of 145
Using HDFS Storage Locations Using HDFS Storage Locations The Vertica Storage Location for HDFS lets Vertica store its data in a Hadoop Distributed File System (HDFS) similarly to how it stores data on a native Linux filesystem. It lets you create a storage tier for lower-priority data to free space on your Vertica cluster for higher-priority data. For example, suppose you store website clickstream data in your Vertica database. You may find that most queries only examine the last six months of this data. However, there are a few low-priority queries that still examine data older than six months. In this case, you could choose to move the older data to an HDFS storage location so that it is still available for the infrequent queries. The queries on the older data are slower because they now access data stored on HDFS rather than native disks. However, you free space on your Vertica cluster's storage for higher-priority, frequently-queried data. Storage Location for HDFS Requirements To store Vertica's data on HDFS, verify that: Your Hadoop cluster has WebHDFS enabled. All of the nodes in your Vertica cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS. See Testing Your Hadoop webhdfs Configuration for a procedure to test the connectivity between your Vertica and Hadoop clusters. You have a Hadoop user whose username matches the name of the Vertica database administrator (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want Vertica to store its data. Your HDFS has enough storage available for Vertica data. See HDFS Space Requirements below for details. The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your Vertica license. Vertica counts data stored in an HDFS-backed storage location as part of any data allowance set by your license. See Managing Licenses in the Administrator's Guide for more information. HPE Vertica Analytic Database (7.2.x) Page 73 of 145
Using HDFS Storage Locations If you are using an HDFS storage location with Kerberos, you must have Kerberos running and the principals defined before creating the storage location. See Create the Principals and Keytabs for instructions on defining the principals. HDFS Space Requirements If your Vertica database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data redundancy feature, Hadoop stores both projections multiple times. This duplication may seem excessive. However, it is similar to how a RAID level 1 or higher redundantly stores copies of both Vertica's primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file. Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details. Additional Requirements for Backing Up Data Stored on HDFS In Premium Edition, to back up your data stored in HDFS storage locations, your Hadoop cluster must: Have HDFS 2.0 or later installed. The vbr backup utility uses the snapshot feature introduced in HDFS 2.0. Have snapshotting enabled for the directories to be used for backups. The easiest way to do this is to give the database administrator's account superuser privileges in Hadoop, so that snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for each directory before using it for backups. In addition, your Vertica database must: Have enough Hadoop components and libraries installed in order to run the Hadoop distcp command as the Vertica database-administrator user (usually dbadmin). HPE Vertica Analytic Database (7.2.x) Page 74 of 145
Using HDFS Storage Locations Have the JavaBinaryForUDx and HadoopHome configuration parameters set correctly. Caution: After you have created an HDFS storage location, full database backups will fail with the error message: ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure Vertica and Hadoop to enable the backup script to back these locations. After you configure Vertica and Hadoop, you can once again perform full database backups. See Backing Up HDFS Storage Locations for details on configuring your Vertica and Hadoop clusters to enable HDFS storage location backup. How the HDFS Storage Location Stores Data The Vertica Storage Location for HDFS stores data on the Hadoop HDFS similarly to the way Vertica stores data in the Linux file system. See Managing Storage Locations in the Administrator's Guide for more information about storage locations. When you create a storage location on HDFS, Vertica stores the ROS containers holding its data on HDFS. You can choose which data uses the HDFS storage location: from the data for just a single table to all of the database's data. When Vertica reads data from or writes data to an HDFS storage location, the node storing or retrieving the data contacts the Hadoop cluster directly to transfer the data. If a single ROS container file is split among several Hadoop nodes, the Vertica node connects to each of them. The Vertica node retrieves the pieces and reassembles the file. By having each node fetch its own data directly from the source, data transfers are parallel, increasing their efficiency. Having the Vertica nodes directly retrieve the file splits also reduces the impact on the Hadoop cluster. What You Can Store on HDFS Use HDFS storage locations to store only data. You cannot store catalog information in an HDFS storage location. HPE Vertica Analytic Database (7.2.x) Page 75 of 145
Using HDFS Storage Locations Caution: While it is possible to use an HDFS storage location for temporary data storage, you must never do so. Using HDFS for temporary storage causes severe performance issues. The only time you change an HDFS storage location's usage to temporary is when you are in the process of removing it. What HDFS Storage Locations Cannot Do Because Vertica uses the storage locations to store ROS containers in a proprietary format, MapReduce and other Hadoop components cannot access your Vertica data stored in HDFS. Never allow another program that has access to HDFS to write to the ROS files. Any outside modification of these files can lead to data corruption and loss. Use the Vertica Connector for Hadoop MapReduce if you need your MapReduce job to access Vertica data. Other applications must use the Vertica client libraries to access Vertica data. The storage location stores and reads only ROS containers. It cannot read data stored in native formats in HDFS. If you want Vertica to read data from HDFS, use the Vertica Connector for HDFS. If the data you want to access is available in a Hive database, you can use the Vertica Connector for HCatalog. Creating an HDFS Storage Location Before creating an HDFS storage location, you must first create a Hadoop user who can access the data: If your HDFS cluster is unsecured, create a Hadoop user whose username matches the user name of the Vertica database administrator account. For example, suppose your database administrator account has the default username dbadmin. You must create a Hadoop user account named dbadmin and give it full read and write access to the directory on HDFS to store files. If your HDFS cluster uses Kerberos authentication, create a Kerberos principal for Vertica and give it read and write access to the HDFS directory that will be used for the storage location. See Configuring Kerberos. Consult the documentation for your Hadoop distribution to learn how to create a user and grant the user read and write permissions for a directory in HDFS. HPE Vertica Analytic Database (7.2.x) Page 76 of 145
Using HDFS Storage Locations Use the CREATE LOCATION statement to create an HDFS storage location. To do so, you must: Supply the WebHDFS URI for HDFS directory where you want Vertica to store the location's data as the path argument,. This URI is the same as a standard HDFS URL, except it uses the webhdfs:// protocol and its path does not start with /webhdfs/v1/. Include the ALL NODES SHARED keywords, as all HDFS storage locations are shared storage. This is required even if you have only one HDFS node in your cluster. The following example demonstrates creating an HDFS storage location that: Is located on the Hadoop cluster whose name node's host name is hadoop. Stores its files in the /user/dbadmin directory. Is labeled coldstorage. The example also demonstrates querying the STORAGE_LOCATIONS system table to verify that the storage location was created. => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; CREATE LOCATION => SELECT node_name,location_path,location_label FROM STORAGE_LOCATIONS; node_name location_path location_label ------------------+------------------------------------------------------+---------------- v_vmart_node0001 /home/dbadmin/vmart/v_vmart_node0001_data v_vmart_node0001 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 coldstorage v_vmart_node0002 /home/dbadmin/vmart/v_vmart_node0002_data v_vmart_node0002 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 coldstorage v_vmart_node0003 /home/dbadmin/vmart/v_vmart_node0003_data v_vmart_node0003 webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 coldstorage (6 rows) Each node in the cluster has created its own directory under the dbadmin directory in HDFS. These individual directories prevent the nodes from interfering with each other's files in the shared location. HPE Vertica Analytic Database (7.2.x) Page 77 of 145
Using HDFS Storage Locations Creating a Storage Location Using Vertica for SQL on Hadoop If you are using the Premium Edition product, then you typically use HDFS storage locations for lower-priority data as shown in the previous example. If you are using the Vertica for SQL on Hadoop product, however, all of your data must be stored in HDFS. To create an HDFS storage location that complies with the Vertica for SQL on Hadoop license, first create the location on all nodes and then set its storage policy to HDFS. To create the location in HDFS on all nodes: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'HDFS'; Next, set the storage policy for your database to use this location: => SELECT set_object_storage_policy('dbname','hdfs ); This causes all data to be written to the HDFS storage location instead of the local disk. For more information, see "Best Practices for SQL on Hadoop" in Managing Storage Locations. Adding HDFS Storage Locations to New Nodes Any nodes you add to your cluster do not have access to existing HDFS storage locations. You must manually create the storage location for the new node using the CREATE LOCATION statement. Do not use the ALL NODES keyword in this statement. Instead, use the NODE keyword with the name of the new node to tell Vertica that just that node needs to add the shared location. Caution: You must manually create the storage location. Otherwise, the new node uses the default storage policy (usually, storage on the local Linux filesystem) to store data that the other the nodes store in HDFS. As a result, the node can run out of disk space. The following example shows how to add the storage location from the preceding example to a new node named v_vmart_node0004: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' NODE 'v_vmart_node0004' SHARED USAGE 'data' LABEL 'coldstorage'; HPE Vertica Analytic Database (7.2.x) Page 78 of 145
Using HDFS Storage Locations Any active standby nodes in your cluster when you create an HDFS-based storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS-based storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location. Creating a Storage Policy for HDFS Storage Locations After you create an HDFS storage location, you assign database objects to the location by setting storage policies. Based on these storage policies, database objects such as partition ranges, individual tables, whole schemas, or even the entire database store their data in the HDFS storage location. Use the SET_OBJECT_STORAGE_ POLICY function to assign objects to an HDFS storage location. In the function call, supply the label you assigned to the HDFS storage location as the location label argument. You do so using the CREATE LOCATION statement's LABEL keyword. The following topics provide examples of storing data on HDFS. Storing an Entire Table in an HDFS Storage Location The following example demonstrates using SET_OBJECT_STORAGE_POLICY to store a table in an HDFS storage location. The example statement sets the policy for an existing table, named messages, to store its data in an HDFS storage location, named coldstorage. => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage'); This table's data is moved to the HDFS storage location with the next merge-out. Alternatively, you can have Vertica move the data immediately by using the enforce_ storage_move parameter. You can query the STORAGE_CONTAINERS system table and examine the location_ label column to verify that Vertica has moved the data: => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_ CONTAINERS WHERE projection_name ILIKE 'messages%'; HPE Vertica Analytic Database (7.2.x) Page 79 of 145
Using HDFS Storage Locations node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 messages_b0 coldstorage 366057 v_vmart_node0001 messages_b1 coldstorage 366511 v_vmart_node0002 messages_b0 coldstorage 367432 v_vmart_node0002 messages_b1 coldstorage 366057 v_vmart_node0003 messages_b0 coldstorage 366511 v_vmart_node0003 messages_b1 coldstorage 367432 (6 rows) See Creating Storage Policies in the Administrator's Guide for more information about assigning storage policies to objects. Storing Table Partitions in HDFS If the data you want to store in an HDFS-based storage location is in a partitioned table, you can choose to store some of the partitions in HDFS. This capability lets you to periodically move old data that is queried less frequently off of more costly higher-speed storage (such as on a solid- state drive). You can instead use slower and less expensive HDFS storage. The older data is still accessible in queries, just at a slower speed. In this scenario, the faster storage is often referred to as "hot storage," and the slower storage is referred to as "cold storage." For example, suppose you have a table named messages containing social media messages that is partitioned by the year and month of the message's timestamp. You can list the partitions in the table by querying the PARTITIONS system table. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key projection_name node_name location_label --------------+-----------------+------------------+---------------- 201309 messages_b1 v_vmart_node0001 201309 messages_b0 v_vmart_node0003 201309 messages_b1 v_vmart_node0002 201309 messages_b1 v_vmart_node0003 201309 messages_b0 v_vmart_node0001 201309 messages_b0 v_vmart_node0002 201310 messages_b0 v_vmart_node0002 201310 messages_b1 v_vmart_node0003 201310 messages_b0 v_vmart_node0001... 201405 messages_b0 v_vmart_node0002 201405 messages_b1 v_vmart_node0003 201405 messages_b1 v_vmart_node0001 201405 messages_b0 v_vmart_node0001 (54 rows) Next, suppose you find that most queries on this table access only the latest month or two of data. You may decide to move the older data to cold storage in an HDFS-based HPE Vertica Analytic Database (7.2.x) Page 80 of 145
Using HDFS Storage Locations storage location. After you move the data, it is still available for queries, but with lower query performance. To move partitions to the HDFS storage location, supply the lowest and highest partition key values to be moved in the SET_OBJECT_STORAGE_POLICY function call. The following example shows how to move data between two dates to an HDFSbased storage location. In this example: Partition key value 201309 represents September 2013. Partition key value 201403 represents March 2014. The name, coldstorage, is the label of the HDFS-based storage location. => SELECT SET_OBJECT_STORAGE_POLICY('messages','coldstorage', '201309', '201403' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true'); After the statement finishes, the range of partitions now appear in the HDFS storage location labeled coldstorage. This location name now displays in the PARTITIONS system table's location_label column. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key projection_name node_name location_label --------------+-----------------+------------------+---------------- 201309 messages_b0 v_vmart_node0003 coldstorage 201309 messages_b1 v_vmart_node0001 coldstorage 201309 messages_b1 v_vmart_node0002 coldstorage 201309 messages_b0 v_vmart_node0001 coldstorage... 201403 messages_b0 v_vmart_node0002 coldstorage 201404 messages_b0 v_vmart_node0001 201404 messages_b0 v_vmart_node0002 201404 messages_b1 v_vmart_node0001 201404 messages_b1 v_vmart_node0002 201404 messages_b0 v_vmart_node0003 201404 messages_b1 v_vmart_node0003 201405 messages_b0 v_vmart_node0001 201405 messages_b1 v_vmart_node0002 201405 messages_b0 v_vmart_node0002 201405 messages_b0 v_vmart_node0003 201405 messages_b1 v_vmart_node0001 201405 messages_b1 v_vmart_node0003 (54 rows) After your initial data move, you can move additional data to the HDFS storage location periodically. You move individual partitions or a range of partitions from the "hot" storage to the "cold" storage location using the same method: HPE Vertica Analytic Database (7.2.x) Page 81 of 145
Using HDFS Storage Locations => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage', '201404', '201404' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true'); SET_OBJECT_STORAGE_POLICY ---------------------------- Object storage policy set. (1 row) => SELECT projection_name, node_name, location_label FROM PARTITIONS WHERE PARTITION_KEY = '201404'; projection_name node_name location_label -----------------+------------------+---------------- messages_b0 v_vmart_node0002 coldstorage messages_b0 v_vmart_node0003 coldstorage messages_b1 v_vmart_node0003 coldstorage messages_b0 v_vmart_node0001 coldstorage messages_b1 v_vmart_node0002 coldstorage messages_b1 v_vmart_node0001 coldstorage (6 rows) Moving Partitions to a Table Stored on HDFS Another method of moving partitions from hot storage to cold storage is to move the partition's data to a separate table that is stored on HDFS. This method breaks the data into two tables, one containing hot data and the other containing cold data. Use this method if you want to prevent queries from inadvertently accessing data stored in the slower HDFS storage location. To query the older data, you must explicitly query the cold table. To move partitions: 1. Create a new table whose schema matches that of the existing partitioned table. 2. Set the storage policy of the new table to use the HDFS-based storage location. 3. Use the MOVE_PARTITIONS_TO_TABLE function to move a range of partitions from the hot table to the cold table. The following example demonstrates these steps. You first create a table named cold_ messages. You then assign it the HDFS-based storage location named coldstorage, and, finally, move a range of partitions. => CREATE TABLE cold_messages LIKE messages INCLUDING PROJECTIONS; => SELECT SET_OBJECT_STORAGE_POLICY('cold_messages', 'coldstorage'); => SELECT MOVE_PARTITIONS_TO_TABLE('messages','201309','201403','cold_messages'); Note: The partitions moved using this method do not immediately migrate to the storage location on HDFS. Instead, the Tuple Mover eventually moves them to the HPE Vertica Analytic Database (7.2.x) Page 82 of 145
Using HDFS Storage Locations storage location. Backing Up Vertica Storage Locations for HDFS Note: The backup and restore features are available only in the Premium Edition product, not in Vertica for SQL on Hadoop. HP recommends that you regularly back up the data in your Vertica database. This recommendation includes data stored in your HDFS storage locations. The Vertica backup script (vbr) can back up HDFS storage locations. However, you must perform several configuration steps before it can back up these locations. Caution: After you have created an HDFS storage location, full database backups will fail with the error message: ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure Vertica and Hadoop to enable the backup script to back these locations. After you configure Vertica and Hadoop, you can once again perform full database backups. There are several considerations for backing up HDFS storage locations in your database: The HDFS storage location backup feature relies on the snapshotting feature introduced in HDFS 2.0. You cannot back up an HDFS storage location stored on an earlier version of HDFS. HDFS storage locations do not support object-level backups. You must perform a full database backup in order to back up the data in your HDFS storage locations. Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery plan for your Hadoop cluster. HPE Vertica Analytic Database (7.2.x) Page 83 of 145
Using HDFS Storage Locations Data stored on the Linux native filesystem is still backed up to the location you specify in the backup configuration file. It and the data in HDFS storage locations are handled separately by the vbr backup script. You must configure your Vertica cluster in order to restore database backups containing an HDFS storage location. See Configuring Vertica to Back Up HDFS Storage Locations for the configuration steps you must take. The HDFS directory for the storage location must have snapshotting enabled.you can either directly configure this yourself or enable the database administrator s Hadoop account to do it for you automatically. See Configuring Hadoop to Enable Backup of HDFS Storage for more information. The topics in this section explain the configuration steps you must take to enable the backup of HDFS storage locations. Configuring Vertica to Restore HDFS Storage Locations Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster. The steps you need to take depend on: The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location. The distribution of Linux running on your Vertica cluster. Note: Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files. HPE Vertica Analytic Database (7.2.x) Page 84 of 145
Using HDFS Storage Locations Configuration Overview The steps for configuring your Vertica cluster to restore backups for HDFS storage location are: 1. If necessary, install and configure a Java runtime on the hosts in the Vertica cluster. 2. Find the location of your Hadoop distribution's package repository. 3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster. 4. Install the necessary Hadoop packages on your Vertica hosts. 5. Set two configuration parameters in your Vertica database related to Java and Hadoop. 6. If your HDFS storage location uses Kerberos, set additional configuration parameters to allow Vertica user credentials to be proxied. 7. Confirm that the Hadoop distcp command runs on your Vertica hosts. The following sections describe these steps in greater detail. Installing a Java Runtime You Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to: Execute User-Defined Extensions developed in Java. See Developing User Defined Extensions for more information. Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for more information. If your Vertica database does have a JVM installed, you must verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports. If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and HPE Vertica Analytic Database (7.2.x) Page 85 of 145
Using HDFS Storage Locations your Hadoop distribution. See Vertica SDKs in Supported Platforms for a list of the JVMs compatible with Vertica. If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instruction in Installing the Java Runtime on Your Vertica Cluster. Finding Your Hadoop Distribution's Package Repository Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and.deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages. Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques. For example: The Hortonworks Version 2.1 topic on Configuring the Remote Repositories. The "Steps to Install CDH 5 Manually" section of the Cloudera Version 5.1.0 topic Installing CDH 5. Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your Vertica cluster. Be sure that the package repository that you select matches version of Hadoop distribution installed on your Hadoop cluster. Configuring Vertica Nodes to Access the Hadoop Distribution s Package Repository Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation. The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves: HPE Vertica Analytic Database (7.2.x) Page 86 of 145
Using HDFS Storage Locations Downloading a configuration file. Adding the configuration file to the package management system's configuration directory. For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring. Updating the package management system's index to have it discover new packages. The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu 12.04 host. These steps in this example are explained in the Hortonworks documentation. $ wget http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/hdp.list \ -O /etc/apt/sources.list.d/hdp.list --2014-08-20 11:06:00-- http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/hdp.list Connecting to 16.113.84.10:8080... connected. Proxy request sent, awaiting response... 200 OK Length: 161 [binary/octet-stream] Saving to: `/etc/apt/sources.list.d/hdp.list' 100%[======================================>] 161 --.-K/s in 0s 2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161] $ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD gpg: requesting key 07513CAD from hkp server pgp.mit.edu gpg: /root/.gnupg/trustdb.gpg: trustdb created gpg: key 07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported gpg: Total number processed: 1 gpg: imported: 1 (RSA: 1) $ gpg -a --export 07513CAD apt-key add - OK $ apt-get update Hit http://us.archive.ubuntu.com precise Release.gpg Hit http://extras.ubuntu.com precise Release.gpg Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B] Hit http://us.archive.ubuntu.com precise-updates Release.gpg Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B] Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B] Hit http://us.archive.ubuntu.com precise-backports Release.gpg Hit http://extras.ubuntu.com precise Release Get:4 http://security.ubuntu.com precise-security Release [50.7 kb] Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B] Hit http://us.archive.ubuntu.com precise Release Hit http://extras.ubuntu.com precise/main Sources Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B] Hit http://us.archive.ubuntu.com precise-updates Release HPE Vertica Analytic Database (7.2.x) Page 87 of 145
Using HDFS Storage Locations Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B] Get:8 http://security.ubuntu.com precise-security/main Sources [108 kb] Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]... Reading package lists... Done You must add the Hadoop repository to all hosts in your Vertica cluster. Installing the Required Hadoop Packages After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are: hadoop hadoop-hdfs hadoop-client The names of the packages are usually the same across all Hadoop and Linux distributions.these packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install. To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution: On Red Hat and CentOS, the package manager command is yum. On Debian and Ubuntu, the package manager command is apt-get. On SUSE the package manager command is zypper. Consult your Linux distribution's documentation for instructions on installing packages. The following example demonstrates installing the required Hadoop packages from the Hortonworks 2.1 distribution on an Ubuntu 12.04 system. # apt-get install hadoop hadoop-hdfs hadoop-client Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper The following NEW packages will be installed: bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn HPE Vertica Analytic Database (7.2.x) Page 88 of 145
Using HDFS Storage Locations zookeeper 0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded. Need to get 86.6 MB of archives. After this operation, 99.8 MB of additional disk space will be used. Do you want to continue [Y/n]? Y Get:1 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main bigtop-jsvc amd64 1.0.10-1 [28.5 kb] Get:2 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main zookeeper all 3.4.5.2.1.3.0-563 [6,820 kb] Get:3 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop all 2.4.0.2.1.3.0-563 [21.5 MB] Get:4 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB] Get:5 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB] Get:6 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB] Get:7 http://public-repo-1.hortonworks.com/hdp/ubuntu12/2.1.3.0/ HDP/main hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B] Fetched 86.6 MB in 1min 2s (1,396 kb/s) Selecting previously unselected package bigtop-jsvc. (Reading database... 197894 files and directories currently installed.) Unpacking bigtop-jsvc (from.../bigtop-jsvc_1.0.10-1_amd64.deb)... Selecting previously unselected package zookeeper. Unpacking zookeeper (from.../zookeeper_3.4.5.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop. Unpacking hadoop (from.../hadoop_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-hdfs. Unpacking hadoop-hdfs (from.../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-yarn. Unpacking hadoop-yarn (from.../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-mapreduce. Unpacking hadoop-mapreduce (from.../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb)... Selecting previously unselected package hadoop-client. Unpacking hadoop-client (from.../hadoop-client_2.4.0.2.1.3.0-563_all.deb)... Processing triggers for man-db... Setting up bigtop-jsvc (1.0.10-1)... Setting up zookeeper (3.4.5.2.1.3.0-563)... update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode. Setting up hadoop (2.4.0.2.1.3.0-563)... update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoop-conf) in auto mode. Setting up hadoop-hdfs (2.4.0.2.1.3.0-563)... Setting up hadoop-yarn (2.4.0.2.1.3.0-563)... Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563)... Setting up hadoop-client (2.4.0.2.1.3.0-563)... Processing triggers for libc-bin... ldconfig deferred processing now taking place Setting Configuration Parameters You must set two configuration parameters to enable Vertica to restore HDFS data: HPE Vertica Analytic Database (7.2.x) Page 89 of 145
Using HDFS Storage Locations JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command: which java HadoopHome is the path where Hadoop is installed on the Vertica hosts. This is the directory that contains bin/hadoop (the bin directory containing the Hadoop executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop. The following example demonstrates setting and then reviewing the values of these parameters. => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; => SELECT get_config_parameter('javabinaryforudx'); get_config_parameter ---------------------- /usr/bin/java (1 row) => ALTER DATABASE mydb SET HadoopHome = '/usr'; => SELECT get_config_parameter('hadoophome'); get_config_parameter ---------------------- /usr (1 row) There are additional parameters you may, optionally, set: HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds, the Hadoop default. If you are confident that your file system will fail more quickly, you can potentially improve performance by lowering these values. HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client chooses this; Vertica uses the same value for all nodes. We recommend against changing this unless directed to. HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB. HPE Vertica Analytic Database (7.2.x) Page 90 of 145
Using HDFS Storage Locations Setting Kerberos Parameters If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters: Parameter yarn.resourcemanager.proxy-user-privileges.enabled Value true yarn.resourcemanager.proxyusers.*.groups * yarn.resourcemanager.proxyusers.*.hosts * yarn.resourcemanager.proxyusers.*.users * yarn.timeline-service.http-authentication.proxyusers.*.groups * yarn.timeline-service.http-authentication.proxyusers.*.hosts * yarn.timeline-service.http-authentication.proxyusers.*.users * No changes are needed on HDFS nodes that are not also Vertica nodes. Confirming that distcp Runs Once the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it: 1. Log into any host in your cluster as the database administrator. 2. At the Bash shell, enter the command: $ hadoop distcp 3. The command should print a message similar to the following: usage: distcp OPTIONS [source_path...] <target_path> OPTIONS -async Should distcp execution be blocking -atomic Commit all changes or none -bandwidth <arg> Specify bandwidth per map in MB HPE Vertica Analytic Database (7.2.x) Page 91 of 145
Using HDFS Storage Locations -delete Delete from target, files missing in source -f <arg> List of files that need to be copied -filelimit <arg> (Deprecated!) Limit number of files copied to <= n -i Ignore failures during copy -log <arg> Folder on DFS where distcp execution logs are saved -m <arg> Max number of concurrent maps to use for copy -mapredsslconf <arg> Configuration for ssl config file, to use with hftps:// -overwrite Choose to overwrite target files unconditionally, even if they exist. -p <arg> preserve status (rbugpc)(replication, block-size, user, group, permission, checksum-type) -sizelimit <arg> (Deprecated!) Limit number of files copied to <= n bytes -skipcrccheck Whether to skip CRC checks between source and target paths. -strategy <arg> Copy strategy to use. Default is dividing work based on file sizes -tmp <arg> Intermediate work path to be used for atomic commit -update Update target, copying only missingfiles or directories 4. Repeat these steps on the other hosts in your database to ensure all of the hosts can run distcp. Troubleshooting If you cannot run the distcp command, try the following steps: If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary. Ensure the version of Java installed on your Vertica cluster is compatible with your Hadoop distribution. Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues. Ensure that the database administrator account has permission to execute the hadoop command. You may need to add the account to a specific group in order to allow it to run the necessary commands. HPE Vertica Analytic Database (7.2.x) Page 92 of 145
Using HDFS Storage Locations Configuring Hadoop and Vertica to Enable Backup of HDFS Storage The Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS storage locations. A directory must allow snapshotting before HDFS can take a snapshot. Only a Hadoop superuser can enable snapshotting on a directory. Vertica can enable snapshotting automatically if the database administrator is also a Hadoop superuser. If HDFS is unsecured, the following instructions apply to the database administrator account, usually dbadmin. If HDFS uses Kerberos security, the following instructions apply to the principal stored in the Vertica keytab file, usually vertica. The instructions below use the term "database account" to refer to this user. We recommend that you make the database administrator or principal a Hadoop superuser. If you are not able to do so, you must enable snapshotting on the directory before configuring it for use by Vertica. The steps you need to take to make the Vertica database administrator account a superuser depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's documentation for details. Instructions for two distributions are provided here. Granting Superuser Status on Hortonworks 2.1 To make the database account a Hadoop superuser: 1. Log into the your Hadoop cluster's Hortonworks Hue web user interface. If your Hortonworks cluster uses Ambari or you do not have a web-based user interface, see the Hortonworks documentation for information on granting privileges to users. 2. Click the User Admin icon. 3. In the Hue Users page, click the database account''s username. 4. Click the Step 3: Advanced tab. 5. Select Superuser status. HPE Vertica Analytic Database (7.2.x) Page 93 of 145
Using HDFS Storage Locations Granting Superuser Status on Cloudera 5.1 Cloudera Hadoop treats Linux users that are members of the group named supergroup as superusers. Cloudera Manager does not automatically create this group. Cloudera also does not create a Linux user for each Hadoop user. To create a Linux account for the database account and assign the supergroup to it: 1. Log into your Hadoop cluster's NameNode as root. 2. Use the groupadd command to add a group named supergroup. 3. Cloudera does not automatically create a Linux user that corresponds to the database administrator's Hadoop account. If the Linux system does not have a user for your database account you must create it. Use the adduser command to create this user. 4. Use the usermod command to add the database account to supergroup. 5. Verify that the database account is now a member of supergroup using the groups command. 6. Repeat steps 1 through 5 for any other NameNodes in your Hadoop cluster. The following example demonstrates following these steps to grant the database administrator superuser status. # adduser dbadmin # groupadd supergroup # usermod -a -G supergroup dbadmin # groups dbadmin dbadmin : dbadmin supergroup Consult the Linux distribution installed on your Hadoop cluster for more information on managing users and groups. Manually Enabling Snapshotting for a Directory If you cannot grant superuser status to the database account, you can instead enable snapshotting of each directory manually. Use the following command: hdfs dfsadmin -allowsnapshot path Issue this command for each directory on each node. Remember to do this each time you add a new node to your HDFS cluster. HPE Vertica Analytic Database (7.2.x) Page 94 of 145
Using HDFS Storage Locations Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent directory to automatically enable it for child directories. You must enable it for each individual directory. Additional Requirements for Kerberos If HDFS uses Kerberos, then in addition to granting the keytab principal access, you must set a Vertica configuration parameter. In Vertica, set the HadoopConfDir parameter to the location of the directory containing the core-site.xml, hdfs-site.xml, and yarnsite.xml configuration files: => ALTER DATABASE exampledb SET HadoopConfDir = '/hadoop'; All three configuration files must be present in this directory. If your Vertica nodes are not co-located on HDFS nodes, then you must copy these files from an HDFS node to each Vertica node. Use the same path on every database node, because HadoopConfDir is a global value. Testing the Database Account's Ability to Make HDFS Directories Snapshottable After making the database account a Hadoop superuser, you should verify that the account can set directories snapshottable: 1. Log into the Hadoop cluster as the database account (dbadmin by default). 2. Determine a location in HDFS where the database administrator can create a directory. The /tmp directory is usually available. Create a test HDFS directory using the command: hdfs dfs -mkdir /path/testdir 3. Make the test directory snapshottable using the command: hdfs dfsadmin -allowsnapshot /path/testdir The following example demonstrates creating an HDFS directory and making it snapshottable: $ hdfs dfs -mkdir /tmp/snaptest $ hdfs dfsadmin -allowsnapshot /tmp/snaptest Allowing snaphot on /tmp/snaptest succeeded HPE Vertica Analytic Database (7.2.x) Page 95 of 145
Using HDFS Storage Locations Performing Backups Containing HDFS Storage Locations After you configure Hadoop and Vertica, HDFS storage locations are automatically backed up when you perform a full database backup. If you already have a backup configuration file for a full database backup, you do not need to make any changes to it. You just run the vbr backup script as usual to perform the full database backup. See Creating Full and Incremental Backups in the Administrator's Guide for instructions on running the vbr backup script. If you do not have a backup configuration file for a full database backup, you must create one to back up the data in your HDFS storage locations. See Creating vbr Configuration Files in the Administrator's Guide for more information. Removing HDFS Storage Locations The steps to remove an HDFS storage location are similar to standard storage locations: 1. Remove any existing data from the HDFS storage location. 2. Change the location's usage to TEMP. 3. Retire the location on each host that has the storage location defined by using RETIRE_LOCATION. You can use the enforce_storage_move parameter to make the change immediately, or wait for the Tuple Mover to perform its next movout. 4. Drop the location on each host that has the storage location defined by using DROP_LOCATION. 5. Optionally remove the snapshots and files from the HDFS directory for the storage location. The following sections explain each of these steps in detail. Important: If you have backed up the data in the HDFS storage location you are removing, you must perform a full database backup after you remove the location. If you do not and restore the database to a backup made before you removed the location, the location's data is restored. HPE Vertica Analytic Database (7.2.x) Page 96 of 145
Using HDFS Storage Locations Removing Existing Data from an HDFS Storage Location You cannot drop a storage location that contains data or is used by any storage policy. You have several options to remove data and storage policies: Drop all of the objects (tables or schemas) that store data in the location. This is the simplest option. However, you can only use this method if you no longer need the data stored in the HDFS storage location. Change the storage policies of objects stored on HDFS to another storage location. When you alter the storage policy, you force all of the data in HDFS location to move to the new location. This option requires that you have an alternate storage location available. Clear the storage policies of all objects that store data on the storage location. You then move the location's data through a process of retiring it. The following sections explain the last two options in greater detail. Moving Data to Another Storage Location You can move data off of an HDFS storage location by altering the storage policies of the objects that use the location. Use the SET_OBJECT_STORAGE_POLICY function to change each object's storage location. If you set this function's third argument to true, it moves the data off of the storage location before returning. The following example demonstrates moving the table named test from the hdfs2 storage location to another location named ssd. => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0003 test_b0 hdfs2 333631 v_vmart_node0003 test_b0 hdfs2 333631 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0002 test_b1 hdfs2 332233 v_vmart_node0002 test_b0 hdfs2 334136 v_vmart_node0002 test_b0 hdfs2 334136 HPE Vertica Analytic Database (7.2.x) Page 97 of 145
Using HDFS Storage Locations v_vmart_node0002 test_b1 hdfs2 332233 (12 rows) => select set_object_storage_policy('test','ssd', true); set_object_storage_policy -------------------------------------------------- Object storage policy set. Task: moving storages (Table: public.test) (Projection: public.test_b0) (Table: public.test) (Projection: public.test_b1) (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b0 ssd 332233 v_vmart_node0001 test_b0 ssd 332233 v_vmart_node0001 test_b1 ssd 333631 v_vmart_node0001 test_b1 ssd 333631 v_vmart_node0002 test_b0 ssd 334136 v_vmart_node0002 test_b0 ssd 334136 v_vmart_node0002 test_b1 ssd 332233 v_vmart_node0002 test_b1 ssd 332233 v_vmart_node0003 test_b0 ssd 333631 v_vmart_node0003 test_b0 ssd 333631 v_vmart_node0003 test_b1 ssd 334136 v_vmart_node0003 test_b1 ssd 334136 (12 rows) Once you have moved all of the data in the storage location, you are ready to proceed to the next step of removing the storage location. Clearing Storage Policies Another option to move data off of a storage location is to clear the storage policy of each object storing data in the location. You clear an object's storage policy using the CLEAR_OBJECT_STORAGE_POLICY function. Once you clear the storage policy, the Tuple Mover eventually migrates the object's data from the storage location to the database's default storage location. The TM moves the data when it performs a move storage operation. This operation runs infrequently at low priority. Therefore, it may be some time before the data migrates out of the storage location. You can speed up the data migration process by: 1. Calling the RETIRE_LOCATION function to retire the storage location on each host that defines it. HPE Vertica Analytic Database (7.2.x) Page 98 of 145
Using HDFS Storage Locations 2. Calling the MOVE_RETIRED_LOCATION_DATA function to move the location's data to the database's default storage location. 3. Calling the RESTORE_LOCATION function to restore the location on each host that defines it. You must perform this step because you cannot drop retired storage locations. The following example demonstrates clearing the object storage policy of a table stored on HDFS, then performing the steps to move the data off of the location. => SELECT * FROM storage_policies; schema_name object_name policy_details location_label -------------+-------------+----------------+---------------- public test Table hdfs2 (1 row) => SELECT clear_object_storage_policy('test'); clear_object_storage_policy -------------------------------- Object storage policy cleared. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); retire_location --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 retired. (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b1 hdfs2 333631 v_vmart_node0001 test_b0 hdfs2 332233 v_vmart_node0002 test_b1 hdfs2 332233 v_vmart_node0002 test_b0 hdfs2 334136 v_vmart_node0003 test_b1 hdfs2 334136 v_vmart_node0003 test_b0 hdfs2 333631 (6 rows) HPE Vertica Analytic Database (7.2.x) Page 99 of 145
Using HDFS Storage Locations => SELECT move_retired_location_data(); move_retired_location_data ----------------------------------------------- Move data off retired storage locations done (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name projection_name location_label total_row_count ------------------+-----------------+----------------+----------------- v_vmart_node0001 test_b0 332233 v_vmart_node0001 test_b1 333631 v_vmart_node0002 test_b0 334136 v_vmart_node0002 test_b1 332233 v_vmart_node0003 test_b0 333631 v_vmart_node0003 test_b1 334136 (6 rows) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 restored. (1 row) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 restored. (1 row) => SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); restore_location ---------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 restored. (1 row) Changing the Usage of HDFS Storage Locations You cannot drop a storage location that allows the storage of data files (ROS containers). Before you can drop an HDFS storage location, you must change its usage from DATA to TEMP using the ALTER_LOCATION_USE function. Make this change on every host in the cluster that defines the storage location. Important: HPE recommends that you do not use HDFS storage locations for temporary file storage. Only set HDFS storage locations to allow temporary file storage as part of the removal process. HPE Vertica Analytic Database (7.2.x) Page 100 of 145
Using HDFS Storage Locations The following example demonstrates using the ALTER_LOCATION_USE function to change the HDFS storage location to temporary file storage. The example calls the function three times: once for each node in the cluster that defines the location. => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003','temp'); ALTER_LOCATION_USE --------------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 usage changed. (1 row) Dropping an HDFS Storage Location After removing all data and changing the data usage of an HDFS storage location, you can drop it. Use the DROP_LOCATION function to drop the storage location from each host that defines it. The following example demonstrates dropping an HDFS storage location from a threenode Vertica database. => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 dropped. (1 row) => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 dropped. (1 row) => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); DROP_LOCATION --------------------------------------------------------------- webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 dropped. (1 row) HPE Vertica Analytic Database (7.2.x) Page 101 of 145
Using HDFS Storage Locations Removing Storage Location Files from HDFS Dropping an HDFS storage location does not automatically clean the HDFS directory that stored the location's files. Any snapshots of the data files created when backing up the location are also not deleted. These files consume disk space on HDFS and also prevent the directory from being reused as an HDFS storage location. Vertica refuses to create a storage location in a directory that contains existing files or subdirectories. You must log into the Hadoop cluster to delete the files from HDFS. An alternative is to use some other HDFS file management tool. Removing Backup Snapshots HDFS returns an error if you attempt to remove a directory that has snapshots: $ hdfs dfs -rm -r -f -skiptrash /user/dbadmin/v_vmart_node0001 rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since /user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots The Vertica backup script creates snapshots of HDFS storage locations as part of the backup process. See Backing Up HDFS Storage Locations for more information. If you made backups of your HDFS storage location, you must delete the snapshots before removing the directories. HDFS stores snapshots in a subdirectory named.snapshot. You list the snapshots in the directory using the standard HDFS ls command. The following example demonstrates listing the snapshots defined for node0001. $ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot Found 1 items drwxrwx--- - dbadmin supergroup 0 2014-09-02 10:13 /user/dbadmin/v_vmart_ node0001/.snapshot/s20140902-101358.629 To remove snapshots, use the command: hdfs dfs -removesnapshot directory snapshotname The following example demonstrates the command to delete the snapshot shown in the previous example: $ hdfs dfs -deletesnapshot /user/dbadmin/v_vmart_node0001 s20140902-101358.629 You must delete each snapshot from the directory for each host in the cluster. Once you have deleted the snapshots, you can delete the directories in the storage location. Important: Each snapshot's name is based on a timestamp down to the millisecond. HPE Vertica Analytic Database (7.2.x) Page 102 of 145
Using HDFS Storage Locations Nodes independently create their own snapshot. They do not synchronize snapshot creation, so their snapshot names differ. You must list each node's snapshot directory to learn the names of the snapshots it contains. See Apache's HDFS Snapshot documentation for more information about managing and removing snapshots. Removing the Storage Location Directories You can remove the directories that held the storage location's data by either of the following methods: Use an HDFS file manager to delete directories. See your Hadoop distribution's documentation to determine if it provides a file manager. Log into the Hadoop NameNode using the database administrator s account and use HDFS's rmr command to delete the directories. See Apache's File System Shell Guide for more information. The following example uses the HDFS rmr command from the Linux command line to delete the directories left behind in the HDFS storage location directory /user/dbamin. It uses the -skiptrash flag to force the immediate deletion of the files. $ hdfsp dfs -ls /user/dbadmin Found 3 items drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0001 drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0002 drwxrwx--- - dbadmin supergroup 0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0003 $ hdfs dfs -rmr -skiptrash /user/dbadmin/* Deleted /user/dbadmin/v_vmart_node0001 Deleted /user/dbadmin/v_vmart_node0002 Deleted /user/dbadmin/v_vmart_node0003 Troubleshooting HDFS Storage Locations This topic explains some common issues with HDFS storage locations. HDFS Storage Disk Consumption By default, HDFS makes three copies of each file it stores. This replication help prevent data loss due to disk or system failure. It also helps increase performance by allowing several nodes to handle a request for a file. HPE Vertica Analytic Database (7.2.x) Page 103 of 145
Using HDFS Storage Locations An Vertica database with a K-Safety value of 1 or greater also stores its data redundantly using buddy projections. When a K-Safe Vertica database stores data in an HDFS storage location, its data redundancy is compounded by HDFS's redundancy. HDFS stores three copies of the primary projection's data, plus three copies of the buddy projection for a total of six copies of the data. If you want to reduce the amount of disk storage used by HDFS locations, you can alter the number of copies of data that HDFS stores. The Vertica configuration parameter named HadoopFSReplication controls the number of copies of data HDFS stores. You can determine the current HDFS disk usage by logging into the Hadoop NameNode and issuing the command: hdfs dfsadmin -report This command prints the usage for the entire HDFS storage, followed by details for each node in the Hadoop cluster. The following example shows the beginning of the output from this command, with the total disk space highlighted: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32087212032 (29.88 GB) DFS Remaining: 31565144064 (29.40 GB) DFS Used: 522067968 (497.88 MB) DFS Used%: 1.63% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... After loading a simple million-row table into a table stored in an HDFS storage location, the report shows greater disk usage: Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32085299338 (29.88 GB) DFS Remaining: 31373565952 (29.22 GB) DFS Used: 711733386 (678.76 MB) DFS Used%: 2.22% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... The following Vertica example demonstrates: HPE Vertica Analytic Database (7.2.x) Page 104 of 145
Using HDFS Storage Locations 1. Dropping the table in Vertica. 2. Setting the HadoopFSReplication configuration option to 1. This tells HDFS to store a single copy of an HDFS storage location's data. 3. Recreating the table and reloading its data. => DROP TABLE messages; DROP TABLE => ALTER DATABASE mydb SET HadoopFSReplication = 1; => CREATE TABLE messages (id INTEGER, text VARCHAR); CREATE TABLE => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'hdfs'); SET_OBJECT_STORAGE_POLICY ---------------------------- Object storage policy set. (1 row) => COPY messages FROM '/home/dbadmin/messages.txt' DIRECT; Rows Loaded ------------- 1000000 Running the HDFS report on Hadoop now shows less disk space use: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32086278190 (29.88 GB) DFS Remaining: 31500988416 (29.34 GB) DFS Used: 585289774 (558.18 MB) DFS Used%: 1.82% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0... Caution: Reducing the number of copies of data stored by HDFS increases the risk of data loss. It can also negatively impact the performance of HDFS by reducing the number of nodes that can provide access to a file. This slower performance can impact the performance of Vertica queries that involve data stored in an HDFS storage location. HPE Vertica Analytic Database (7.2.x) Page 105 of 145
Using HDFS Storage Locations Kerberos Authentication When Creating a Storage Location If HDFS uses Kerberos authentication, then the CREATE LOCATION statement authenticates using the Vertica keytab principal, not the principal of the user performing the action. If the creation fails with an authentication error, verify that you have followed the steps described in Configuring Kerberos to configure this principal. When creating an HDFS storage location on a Hadoop cluster using Kerberos, CREATE LOCATION reports the principal being used as in the following example: => CREATE LOCATION 'webhdfs://hadoop.example.com:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; NOTICE 0: Performing HDFS operations using kerberos principal [vertica/hadoop.example.com] CREATE LOCATION Backup or Restore Fails When Using Kerberos When backing up an HDFS storage location that uses Kerberos, you might see an error such as: createsnapshot: Failed on local exception: java.io.ioexception: java.lang.illegalargumentexception: Server has invalid Kerberos principal: hdfs/test.example.com@example.com; When restoring an HDFS storage location that uses Kerberos, you might see an error such as: Error msg: Initialization thread logged exception: Distcp failure! Either of these failures means that Vertica could not find the required configuration files in the HadoopConfDir directory. Usually this is because you have set the parameter but not copied the files from an HDFS node to your Vertica node. See "Additional Requirements for Kerberos" in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage. HPE Vertica Analytic Database (7.2.x) Page 106 of 145
Using the MapReduce Connector Using the MapReduce Connector The Vertica Connector for Hadoop MapReduce lets you create Hadoop MapReduce jobs that can read data from and write data to Vertica. You commonly use it when: You need to incorporate data from Vertica into your MapReduce job. For example, suppose you are using Hadoop's MapReduce to process web server logs. You may want to access sentiment analysis data stored in Vertica using Pulse to try to correlate a website visitor with social media activity. You are using Hadoop MapReduce to refine data on which you want to perform analytics. You can have your MapReduce job directly insert data into Vertica where you can analyze it in real time using all of Vertica's features. MapReduce Connector Features The MapReduce Connector: gives Hadoop access to data stored in Vertica. lets Hadoop store its results in Vertica. The Connector can create a table for the Hadoop data if it does not already exist. lets applications written in Apache Pig access and store data in Vertica. works with Hadoop streaming. The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and Vertica nodes communicate with each other directly. Direct connections allow data to be transferred in parallel, dramatically increasing processing speed. The Connector is written in Java, and is compatible with all platforms supported by Hadoop. Note: To prevent Hadoop from potentially inserting multiple copies of data into Vertica, the Vertica Connector for Hadoop Map Reduce disables Hadoop's speculative execution feature. HPE Vertica Analytic Database (7.2.x) Page 107 of 145
Using the MapReduce Connector Prerequisites Before you can use the Vertica Connector for Hadoop MapReduce, you must install and configure Hadoop and be familiar with developing Hadoop applications. For details on installing and using Hadoop, please see the Apache Hadoop Web site. See Vertica 7.2.x Supported Platforms for a list of the versions of Hadoop and Pig that the connector supports. Hadoop and Vertica Cluster Scaling When using the connector for MapReduce, nodes in the Hadoop cluster connect directly to Vertica nodes when retrieving or storing data. These direct connections allow the two clusters to transfer large volumes of data in parallel. If the Hadoop cluster is larger than the Vertica cluster, this parallel data transfer can negatively impact the performance of the Vertica database. To avoid performance impacts on your Vertica database, ensure that your Hadoop cluster cannot overwhelm your Vertica cluster. The exact sizing of each cluster depends on how fast your Hadoop cluster generates data requests and the load placed on the Vertica database by queries from other sources. A good rule of thumb to follow is for your Hadoop cluster to be no larger than your Vertica cluster. Installing the Connector Follow these steps to install the MapReduce Connector: If you have not already done so, download the Vertica Connector for Hadoop Map Reduce installation package from the myvertica portal. Be sure to download the package that is compatible with your version of Hadoop. You can find your Hadoop version by issuing the following command on a Hadoop node: # hadoop version You will also need a copy of the Vertica JDBC driver which you can also download from the myvertica portal. You need to perform the following steps on each node in your Hadoop cluster: HPE Vertica Analytic Database (7.2.x) Page 108 of 145
Using the MapReduce Connector 1. Copy the Vertica Connector for Hadoop Map Reduce.zip archive you downloaded to a temporary location on the Hadoop node. 2. Copy the Vertica JDBC driver.jar file to the same location on your node. If you haven't already, you can download this driver from the myvertica portal. 3. Unzip the connector.zip archive into a temporary directory. On Linux, you usually use the command unzip. 4. Locate the Hadoop home directory (the directory where Hadoop is installed). The location of this directory depends on how you installed Hadoop (manual install versus a package supplied by your Linux distribution or Cloudera). If you do not know the location of this directory, you can try the following steps: See if the HADOOP_HOME environment variable is set by issuing the command echo $HADOOP_HOME on the command line. See if Hadoop is in your path by typing hadoop classpath on the command line. If it is, this command lists the paths of all the jar files used by Hadoop, which should tell you the location of the Hadoop home directory. If you installed using a.deb or.rpm package, you can look in /usr/lib/hadoop, as this is often the location where these packages install Hadoop. 5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connector archive to the lib subdirectory in the Hadoop home directory. 6. Copy the Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Hadoop home directory ($HADOOP_HOME/lib). 7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines: # Extra Java CLASSPATH elements. Optional. # export HADOOP_CLASSPATH= Uncomment the export line by removing the hash character (#) and add the absolute path of the JDBC driver file you copied in the previous step. For example: HPE Vertica Analytic Database (7.2.x) Page 109 of 145
Using the MapReduce Connector export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica-jdbc-x.x.x.jar This environment variable ensures that Hadoop can find the Vertica JDBC driver. 8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environment variable is set to your Java installation. 9. If you want your application written in Pig to be able to access Vertica, you need to: a. Locate the Pig home directory. Often, this directory is in the same parent directory as the Hadoop home directory. b. Copy the file named pig-vertica.jar from the directory where you unpacked the connector.zip file to the lib subdirectory in the Pig home directory. c. Copy the Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Pig home directory. HPE Vertica Analytic Database (7.2.x) Page 110 of 145
Using the MapReduce Connector Accessing Vertica Data From Hadoop You need to follow three steps to have Hadoop fetch data from Vertica: Set the Hadoop job's input format to be VerticaInputFormat. Give the VerticaInputFormat class a query to be used to extract data from Vertica. Create a Mapper class that accepts VerticaRecord objects as input. The following sections explain each of these steps in greater detail. Selecting VerticaInputFormat The first step to reading Vertica data from within a Hadoop job is to set its input format. You usually set the input format within the run() method in your Hadoop application's class. To set up the input format, pass the job.setinputformatclass method the VerticaInputFormat.class, as follows: public int run(string[] args) throws Exception { // Set up the configuration and job objects Configuration conf = getconf(); Job job = new Job(conf); (later in the code) // Set the input format to retrieve data from // Vertica. job.setinputformatclass(verticainputformat.class); Setting the input to the VerticaInputFormat class means that the map method will get VerticaRecord objects as its input. HPE Vertica Analytic Database (7.2.x) Page 111 of 145
Using the MapReduce Connector Setting the Query to Retrieve Data From Vertica A Hadoop job that reads data from your Vertica database has to execute a query that selects its input data. You pass this query to your Hadoop application using the setinput method of the VerticaInputFormat class. The Vertica Connector for Hadoop Map Reduce sends this query to the Hadoop nodes which then individually connect to Vertica nodes to run the query and get their input data. A primary consideration for this query is how it segments the data being retrieved from Vertica. Since each node in the Hadoop cluster needs data to process, the query result needs to be segmented between the nodes. There are three formats you can use for the query you want your Hadoop job to use when retrieving input data. Each format determines how the query's results are split across the Hadoop cluster. These formats are: A simple, self-contained query. A parameterized query along with explicit parameters. A parameterized query along with a second query that retrieves the parameter values for the first query from Vertica. The following sections explain each of these methods in detail. Using a Simple Query to Extract Data From Vertica The simplest format for the query that Hadoop uses to extract data from Vertica is a selfcontained hard-coded query. You pass this query in a String to the setinput method of the VerticaInputFormat class. You usually make this call in the run method of your Hadoop job class. For example, the following code retrieves the entire contents of the table named alltypes. // Sets the query to use to get the data from the Vertica database. // Simple query with no parameters VerticaInputFormat.setInput(job, "SELECT * FROM alltypes ORDER BY key;"); The query you supply must have an ORDER BY clause, since the Vertica Connector for Hadoop Map Reduce uses it to figure out how to segment the query results between the Hadoop nodes. When it gets a simple query, the connector calculates limits and offsets HPE Vertica Analytic Database (7.2.x) Page 112 of 145
Using the MapReduce Connector to be sent to each node in the Hadoop cluster, so they each retrieve a portion of the query results to process. Having Hadoop use a simple query to retrieve data from Vertica is the least efficient method, since the connector needs to perform extra processing to determine how the data should be segmented across the Hadoop nodes. Using a Parameterized Query and Parameter Lists You can have Hadoop retrieve data from Vertica using a parametrized query, to which you supply a set of parameters. The parameters in the query are represented by a question mark (?). You pass the query and the parameters to the setinput method of the VerticaInputFormat class. You have two options for passing the parameters: using a discrete list, or by using a Collection object. Using a Discrete List of Values To pass a discrete list of parameters for the query, you include them in the setinput method call in a comma-separated list of string values, as shown in the next example: // Simple query with supplied parameters VerticaInputFormat.setInput(job, "SELECT * FROM alltypes WHERE key =?", "1001", "1002", "1003"); The Vertica Connector for Hadoop Map Reduce tries to evenly distribute the query and parameters among the nodes in the Hadoop cluster. If the number of parameters is not a multiple of the number of nodes in the cluster, some nodes will get more parameters to process than others. Once the connector divides up the parameters among the Hadoop nodes, each node connects to a host in the Vertica database and executes the query, substituting in the parameter values it received. This format is useful when you have a discrete set of parameters that will not change over time. However, it is inflexible because any changes to the parameter list requires you to recompile your Hadoop job. An added limitation is that the query can contain just a single parameter, because the setinput method only accepts a single parameter list. The more flexible way to use parameterized queries is to use a collection to contain the parameters. Using a Collection Object The more flexible method of supplying the parameters for the query is to store them into a Collection object, then include the object in the setinput method call. This method HPE Vertica Analytic Database (7.2.x) Page 113 of 145
Using the MapReduce Connector allows you to build the list of parameters at run time, rather than having them hardcoded. You can also use multiple parameters in the query, since you will pass a collection of ArrayList objects to setinput statement. Each ArrayList object supplies one set of parameter values, and can contain values for each parameter in the query. The following example demonstrates using a collection to pass the parameter values for a query containing two parameters. The collection object passed to setinput is an instance of the HashSet class. This object contains four ArrayList objects added within the for loop. This example just adds dummy values (the loop counter and the string "FOUR"). In your own application, you usually calculate parameter values in some manner before adding them to the collection. Note: If your parameter values are stored in Vertica, you should specify the parameters using a query instead of a collection. See Using a Query to Retrieve Parameters for a Parameterized Query for details. // Collection to hold all of the sets of parameters for the query. Collection<List<Object>> params = new HashSet<List<Object>>() { }; // Each set of parameters lives in an ArrayList. Each entry // in the list supplies a value for a single parameter in // the query. Here, ArrayList objects are created in a loop // that adds the loop counter and a static string as the // parameters. The ArrayList is then added to the collection. for (int i = 0; i < 4; i++) { ArrayList<Object> param = new ArrayList<Object>(); param.add(i); param.add("four"); params.add(param); } VerticaInputFormat.setInput(job, "select * from alltypes where key =? AND NOT varcharcol =?", params); Scaling Parameter Lists for the Hadoop Cluster Whenever possible, make the number of parameter values you pass to the Vertica Connector for Hadoop Map Reduce equal to the number of nodes in the Hadoop cluster because each parameter value is assigned to a single Hadoop node. This ensures that the workload is spread across the entire Hadoop cluster. If you supply fewer parameter values than there are nodes in the Hadoop cluster, some of the nodes will not get a value and will sit idle. If the number of parameter values is not a multiple of the number of nodes in the cluster, Hadoop randomly assigns the extra values to nodes in the HPE Vertica Analytic Database (7.2.x) Page 114 of 145
Using the MapReduce Connector cluster. It does not perform scheduling it does not wait for a nodes finish its task and become free before assigning additional tasks. In this case, a node could become a bottleneck if it is assigned the longer-running portions of the job. In addition to the number of parameters in the query, you should make the parameter values yield roughly the same number of results. Ensuring each [parameter yields the same number of results helps prevent a single node in the Hadoop cluster from becoming a bottleneck by having to process more data than the other nodes in the cluster. Using a Query to Retrieve Parameter Values for a Parameterized Query You can pass the Vertica Connector for Hadoop Map Reduce a query to extract the parameter values for a parameterized query. This query must return a single column of data that is used as parameters for the parameterized query. To use a query to retrieve the parameter values, supply the VerticaInputFormat class's setinput method with the parameterized query and a query to retrieve parameters. For example: // Sets the query to use to get the data from the Vertica database. // Query using a parameter that is supplied by another query VerticaInputFormat.setInput(job, "select * from alltypes where key =?", "select distinct key from regions"); When it receives a query for parameters, the connector runs the query itself, then groups the results together to send out to the Hadoop nodes, along with the parameterized query. The Hadoop nodes then run the parameterized query using the set of parameter values sent to them by the connector. Writing a Map Class That Processes Vertica Data Once you have set up your Hadoop application to read data from Vertica, you need to create a Map class that actually processes the data. Your Map class's map method receives LongWritable values as keys and VerticaRecord objects as values. The key values are just sequential numbers that identify the row in the query results. The VerticaRecord class represents a single row from the result set returned by the query you supplied to the VerticaInput.setInput method. HPE Vertica Analytic Database (7.2.x) Page 115 of 145
Using the MapReduce Connector Working with the VerticaRecord Class Your map method extracts the data it needs from the VerticaRecord class. This class contains three main methods you use to extract data from the record set: get retrieves a single value, either by index value or by name, from the row sent to the map method. getordinalposition takes a string containing a column name and returns the column's number. gettype returns the data type of a column in the row specified by index value or by name. This method is useful if you are unsure of the data types of the columns returned by the query. The types are stored as integer values defined by the java.sql.types class. The following example shows a Mapper class and map method that accepts VerticaRecord objects. In this example, no real work is done. Instead two values are selected as the key and value to be passed on to the reducer. public static class Map extends Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> { // This mapper accepts VerticaRecords as input. public void map(longwritable key, VerticaRecord value, Context context) throws IOException, InterruptedException { } } // In your mapper, you would do actual processing here. // This simple example just extracts two values from the row of // data and passes them to the reducer as the key and value. if (value.get(3)!= null && value.get(0)!= null) { context.write(new Text((String) value.get(3)), new DoubleWritable((Long) value.get(0))); } If your Hadoop job has a reduce stage, all of the map method output is managed by Hadoop. It is not stored or manipulated in any way by Vertica. If your Hadoop job does not have a reduce stage, and needs to store its output into Vertica, your map method must output its keys as Text objects and values as VerticaRecord objects. HPE Vertica Analytic Database (7.2.x) Page 116 of 145
Using the MapReduce Connector Writing Data to Vertica From Hadoop There are three steps you need to take for your Hadoop application to store data in Vertica: Set the output value class of your Hadoop job to VerticaRecord. Set the details of the Vertica table where you want to store your data in the VerticaOutputFormat class. Create a Reduce class that adds data to a VerticaRecord object and calls its write method to store the data. The following sections explain these steps in more detail. Configuring Hadoop to Output to Vertica To tell your Hadoop application to output data to Vertica you configure your Hadoop application to output to the Vertica Connector for Hadoop Map Reduce. You will normally perform these steps in your Hadoop application's run method. There are three methods that need to be called in order to set up the output to be sent to the connector and to set the output of the Reduce class, as shown in the following example: // Set the output format of Reduce class. It will // output VerticaRecords that will be stored in the // database. job.setoutputkeyclass(text.class); job.setoutputvalueclass(verticarecord.class); // Tell Hadoop to send its output to the Vertica // Vertica Connector for Hadoop Map Reduce. job.setoutputformatclass(verticaoutputformat.class); The call to setoutputvalueclass tells Hadoop that the output of the Reduce.reduce method is a VerticaRecord class object. This object represents a single row of a Vertica database table. You tell your Hadoop job to send the data to the connector by setting the output format class to VerticaOutputFormat. Defining the Output Table Call the VerticaOutputFormat.setOutput method to define the table that will hold the Hadoop application output: VerticaOutputFormat.setOutput(jobObject, tablename, [truncate, ["columnname1 datatype1" [,"columnnamen datatypen"...]] ); HPE Vertica Analytic Database (7.2.x) Page 117 of 145
Using the MapReduce Connector jobobject tablename truncate "columnname1 datatype1" The Hadoop job object for your application. The name of the table to store Hadoop's output. If this table does not exist, the Vertica Connector for Hadoop Map Reduce automatically creates it. The name can be a full database.schema.table reference. A Boolean controlling whether to delete the contents of tablename if it already exists. If set to true, any existing data in the table is deleted before Hadoop's output is stored. If set to false or not given, the Hadoop output is added to the existing data in the table. The table column definitions, where columnname1 is the column name and datatype1 the SQL data type. These two values are separated by a space. If not specified, the existing table is used. The first two parameters are required. You can add as many column definitions as you need in your output table. You usually call the setoutput method in your Hadoop class's run method, where all other setup work is done. The following example sets up an output table named mrtarget that contains 8 columns, each containing a different data type: // Sets the output format for storing data in Vertica. It defines the // table where data is stored, and the columns it will be stored in. VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int", "b boolean", "c char(1)", "d date", "f float", "t timestamp", "v varchar", "z varbinary"); If the truncate parameter is set to true for the method call, and target table already exists in the Vertica database, the connector deletes the table contents before storing the Hadoop job's output. Note: If the table already exists in the database, and the method call truncate parameter is set to false, the Vertica Connector for Hadoop Map Reduce adds new application output to the existing table. However, the connector does not verify that the column definitions in the existing table match those defined in the setoutput method call. If the new application output values cannot be converted to the existing HPE Vertica Analytic Database (7.2.x) Page 118 of 145
Using the MapReduce Connector column values, your Hadoop application can throw casting exceptions. Writing the Reduce Class Once your Hadoop application is configured to output data to Vertica and has its output table defined, you need to create the Reduce class that actually formats and writes the data for storage in Vertica. The first step your Reduce class should take is to instantiate a VerticaRecord object to hold the output of the reduce method. This is a little more complex than just instantiating a base object, since the VerticaRecord object must have the columns defined in it that match the out table's columns (see Defining the Output Table for details). To get the properly configured VerticaRecord object, you pass the constructor the configuration object. You usually instantiate the VerticaRecord object in your Reduce class's setup method, which Hadoop calls before it calls the reduce method. For example: // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Instantiate a VerticaRecord object that has the proper // column definitions. The object is stored in the record // field for use later. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } Storing Data in the VerticaRecord Your reduce method starts the same way any other Hadoop reduce method does it processes its input key and value, performing whatever reduction task your application needs. Afterwards, your reduce method adds the data to be stored in Vertica to the VerticaRecord object that was instantiated earlier. Usually you use the set method to add the data: VerticaRecord.set(column, value); column The column to store the value in. This is either an integer (the column number) or a String (the column name, as defined in the table definition). HPE Vertica Analytic Database (7.2.x) Page 119 of 145
Using the MapReduce Connector Note: The set method throws an exception if you pass it the name of a column that does not exist. You should always use a try/catch block around any set method call that uses a column name. value The value to store in the column. The data type of this value must match the definition of the column in the table. Note: If you do not have the set method validate that the data types of the value and the column match, the Vertica Connector for Hadoop Map Reduce throws a ClassCastException if it finds a mismatch when it tries to commit the data to the database. This exception causes a rollback of the entire result. By having the set method validate the data type of the value, you can catch and resolve the exception before it causes a rollback. In addition to the set method, you can also use the setfromstring method to have the Vertica Connector for Hadoop Map Reduce convert the value from String to the proper data type for the column: VerticaRecord.setFromString(column, "value"); column value The column number to store the value in, as an integer. A String containing the value to store in the column. If the String cannot be converted to the correct data type to be stored in the column, setfromstring throws an exception (ParseException for date values, NumberFormatException numeric values). Your reduce method must output a value for every column in the output table. If you want a column to have a null value you must explicitly set it. After it populates the VerticaRecord object, your reduce method calls the Context.write method, passing it the name of the table to store the data in as the key, and the VerticaRecord object as the value. The following example shows a simple Reduce class that stores data into Vertica. To make the example as simple as possible, the code doesn't actually process the input it receives, and instead just writes dummy data to the database. In your own application, you would process the key and values into data that you then store in the VerticaRecord object. HPE Vertica Analytic Database (7.2.x) Page 120 of 145
Using the MapReduce Connector public static class Reduce extends Reducer<Text, DoubleWritable, Text, VerticaRecord> { // Holds the records that the reducer writes its values to. VerticaRecord record = null; // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Need to call VerticaOutputFormat to get a record object // that has the proper column definitions. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } // The reduce method. public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { // Ensure that the record object has been set up properly. This is // where the results are written. if (record == null) { throw new IOException("No output record found"); } // In this part of your application, your reducer would process the // key and values parameters to arrive at values that you want to // store into the database. For simplicity's sake, this example // skips all of the processing, and just inserts arbitrary values // into the database. // // Use the.set method to store a value in the record to be stored // in the database. The first parameter is the column number, // the second is the value to store. // // Column numbers start at 0. // // Set record 0 to an integer value, you // should always use a try/catch block to catch the exception. try { record.set(0, 125); } catch (Exception e) { // Handle the improper data type here. e.printstacktrace(); } // You can also set column values by name rather than by column // number. However, this requires a try/catch since specifying a // non-existent column name will throw an exception. try { // The second column, named "b", contains a Boolean value. record.set("b", true); HPE Vertica Analytic Database (7.2.x) Page 121 of 145
Using the MapReduce Connector } catch (Exception e) { // Handle an improper column name here. e.printstacktrace(); } // Column 2 stores a single char value. record.set(2, 'c'); // Column 3 is a date. Value must be a java.sql.date type. record.set(3, new java.sql.date( Calendar.getInstance().getTimeInMillis())); // You can use the setfromstring method to convert a string // value into the proper data type to be stored in the column. // You need to use a try...catch block in this case, since the // string to value conversion could fail (for example, trying to // store "Hello, World!" in a float column is not going to work). try { record.setfromstring(4, "234.567"); } catch (ParseException e) { // Thrown if the string cannot be parsed into the data type // to be stored in the column. e.printstacktrace(); } } } // Column 5 stores a timestamp record.set(5, new java.sql.timestamp( Calendar.getInstance().getTimeInMillis())); // Column 6 stores a varchar record.set(6, "example string"); // Column 7 stores a varbinary record.set(7, new byte[10]); // Once the columns are populated, write the record to store // the row into the database. context.write(new Text("mrtarget"), record); Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time Specifying the Location of the Connector.jar File Recent versions of Hadoop fail to find the Vertica Connector for Hadoop Map Reduce classes automatically, even though they are included in the Hadoop lib directory. Therefore, you need to manually tell Hadoop where to find the connector.jar file using the libjars argument: HPE Vertica Analytic Database (7.2.x) Page 122 of 145
Using the MapReduce Connector hadoop jar myhadoopapp.jar com.myorg.hadoop.myhadoopapp \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \... Specifying the Database Connection Parameters You need to pass connection parameters to the Vertica Connector for Hadoop Map Reduce when starting your Hadoop application, so it knows how to connect to your database. At a minimum, these parameters must include the list of hostnames in the Vertica database cluster, the name of the database, and the user name. The common parameters for accessing the database appear in the following table. Usually, you will only need the basic parameters listed in this table in order to start your Hadoop application. Parameter Description Required Default mapred.vertica.hostnames A comma-separated list of the names or Yes none IP addresses of the hosts in the Vertica cluster. You should list all of the nodes in the cluster here, since individual nodes in the Hadoop cluster connect directly with a randomly assigned host in the cluster. The hosts in this cluster are used for both reading from and writing data to the Vertica database, unless you specify a different output database (see below). mapred.vertica.port The port number for the Vertica database. No 5433 mapred.vertica.database The name of the database the Hadoop application should access. Yes mapred.vertica.username The username to use when connecting to the database. Yes mapred.vertica.password The password to use when connecting to No empty the database. HPE Vertica Analytic Database (7.2.x) Page 123 of 145
Using the MapReduce Connector You pass the parameters to the connector using the -D command line switch in the command you use to start your Hadoop application. For example: hadoop jar myhadoopapp.jar com.myorg.hadoop.myhadoopapp \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.vertica.hostnames=Vertica01,Vertica02,Vertica03,Vertica04 \ -Dmapred.vertica.port=5433 -Dmapred.vertica.username=exampleuser \ -Dmapred.vertica.password=password123 -Dmapred.vertica.database=ExampleDB Parameters for a Separate Output Database The parameters in the previous table are all you need if your Hadoop application accesses a single Vertica database. You can also have your Hadoop application read from one Vertica database and write to a different Vertica database. In this case, the parameters shown in the previous table apply to the input database (the one Hadoop reads data from). The following table lists the parameters that you use to supply your Hadoop application with the connection information for the output database (the one it writes its data to). None of these parameters is required. If you do not assign a value to one of these output parameters, it inherits its value from the input database parameters. Parameter Description Default mapred.vertica.hostnames.output A comma-separated list of the names or IP addresses of the hosts in the output Vertica cluster. Input hostnames mapred.vertica.port.output The port number for the output Vertica database. 5433 mapred.vertica.database.output The name of the output database. Input database name mapred.vertica.username.output The username to use when connecting to the output database. Input database user name mapred.vertica.password.output The password to use when connecting to the output database. Input database password HPE Vertica Analytic Database (7.2.x) Page 124 of 145
Using the MapReduce Connector Example Vertica Connector for Hadoop Map Reduce Application This section presents an example of using the Vertica Connector for Hadoop Map Reduce to retrieve and store data from a Vertica database. The example pulls together the code that has appeared on the previous topics to present a functioning example. This application reads data from a table named alltypes. The mapper selects two values from this table to send to the reducer. The reducer doesn't perform any operations on the input, and instead inserts arbitrary data into the output table named mrtarget. package com.vertica.hadoop; import java.io.ioexception; import java.util.arraylist; import java.util.calendar; import java.util.collection; import java.util.hashset; import java.util.iterator; import java.util.list; import java.math.bigdecimal; import java.sql.date; import java.sql.timestamp; // Needed when using the setfromstring method, which throws this exception. import java.text.parseexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import com.vertica.hadoop.verticaconfiguration; import com.vertica.hadoop.verticainputformat; import com.vertica.hadoop.verticaoutputformat; import com.vertica.hadoop.verticarecord; // This is the class that contains the entire Hadoop example. public class VerticaExample extends Configured implements Tool { public static class Map extends Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> { // This mapper accepts VerticaRecords as input. public void map(longwritable key, VerticaRecord value, Context context) throws IOException, InterruptedException { // In your mapper, you would do actual processing here. // This simple example just extracts two values from the row of HPE Vertica Analytic Database (7.2.x) Page 125 of 145
Using the MapReduce Connector } } // data and passes them to the reducer as the key and value. if (value.get(3)!= null && value.get(0)!= null) { context.write(new Text((String) value.get(3)), new DoubleWritable((Long) value.get(0))); } public static class Reduce extends Reducer<Text, DoubleWritable, Text, VerticaRecord> { // Holds the records that the reducer writes its values to. VerticaRecord record = null; // Sets up the output record that will be populated by // the reducer and eventually written out. public void setup(context context) throws IOException, InterruptedException { super.setup(context); try { // Need to call VerticaOutputFormat to get a record object // that has the proper column definitions. record = new VerticaRecord(context.getConfiguration()); } catch (Exception e) { throw new IOException(e); } } // The reduce method. public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { // Ensure that the record object has been set up properly. This is // where the results are written. if (record == null) { throw new IOException("No output record found"); } // In this part of your application, your reducer would process the // key and values parameters to arrive at values that you want to // store into the database. For simplicity's sake, this example // skips all of the processing, and just inserts arbitrary values // into the database. // // Use the.set method to store a value in the record to be stored // in the database. The first parameter is the column number, // the second is the value to store. // // Column numbers start at 0. // // Set record 0 to an integer value, you // should always use a try/catch block to catch the exception. try { record.set(0, 125); HPE Vertica Analytic Database (7.2.x) Page 126 of 145
Using the MapReduce Connector } catch (Exception e) { // Handle the improper data type here. e.printstacktrace(); } // You can also set column values by name rather than by column // number. However, this requires a try/catch since specifying a // non-existent column name will throw an exception. try { // The second column, named "b", contains a Boolean value. record.set("b", true); } catch (Exception e) { // Handle an improper column name here. e.printstacktrace(); } // Column 2 stores a single char value. record.set(2, 'c'); // Column 3 is a date. Value must be a java.sql.date type. record.set(3, new java.sql.date( Calendar.getInstance().getTimeInMillis())); // You can use the setfromstring method to convert a string // value into the proper data type to be stored in the column. // You need to use a try...catch block in this case, since the // string to value conversion could fail (for example, trying to // store "Hello, World!" in a float column is not going to work). try { record.setfromstring(4, "234.567"); } catch (ParseException e) { // Thrown if the string cannot be parsed into the data type // to be stored in the column. e.printstacktrace(); } } } // Column 5 stores a timestamp record.set(5, new java.sql.timestamp( Calendar.getInstance().getTimeInMillis())); // Column 6 stores a varchar record.set(6, "example string"); // Column 7 stores a varbinary record.set(7, new byte[10]); // Once the columns are populated, write the record to store // the row into the database. context.write(new Text("mrtarget"), record); @Override public int run(string[] args) throws Exception { // Set up the configuration and job objects Configuration conf = getconf(); Job job = new Job(conf); conf = job.getconfiguration(); conf.set("mapreduce.job.tracker", "local"); job.setjobname("vertica test"); HPE Vertica Analytic Database (7.2.x) Page 127 of 145
Using the MapReduce Connector // Set the input format to retrieve data from // Vertica. job.setinputformatclass(verticainputformat.class); // Set the output format of the mapper. This is the interim // data format passed to the reducer. Here, we will pass in a // Double. The interim data is not processed by Vertica in any // way. job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(doublewritable.class); // Set the output format of the Hadoop application. It will // output VerticaRecords that will be stored in the // database. job.setoutputkeyclass(text.class); job.setoutputvalueclass(verticarecord.class); job.setoutputformatclass(verticaoutputformat.class); job.setjarbyclass(verticaexample.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); } // Sets the output format for storing data in Vertica. It defines the // table where data is stored, and the columns it will be stored in. VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int", "b boolean", "c char(1)", "d date", "f float", "t timestamp", "v varchar", "z varbinary"); // Sets the query to use to get the data from the Vertica database. // Query using a list of parameters. VerticaInputFormat.setInput(job, "select * from alltypes where key =?", "1", "2", "3"); job.waitforcompletion(true); return 0; } public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new VerticaExample(), args); System.exit(res); } Compiling and Running the Example Application To run the example Hadoop application, you first need to set up the alltypes table that the example reads as input. To set up the input table, save the following Perl script as MakeAllTypes.pl to a location on one of your Vertica nodes: #!/usr/bin/perl open FILE, ">datasource" or die $!; for ($i=0; $i < 10; $i++) { print FILE $i. " ". rand(10000); print FILE " one ONE 1 1999-01-08 1999-02-23 03:11:52.35"; HPE Vertica Analytic Database (7.2.x) Page 128 of 145
Using the MapReduce Connector print FILE ' 1999-01-08 07:04:37 07:09:23 15:12:34 EST 0xabcd '; print FILE '0xabcd 1234532 03:03:03'. qq(\n); } close FILE; Then follow these steps: 1. Connect to the node where you saved the MakeAllTypes.pl file. 2. Run the MakeAllTypes.pl file. This will generate a file named datasource in the current directory. Note: If your node does not have Perl installed, can can run this script on a system that does have Perl installed, then copy the datasource file to a database node. 3. On the same node, use vsql to connect to your Vertica database. 4. Run the following query to create the alltypes table: CREATE TABLE alltypes (key identity,intcol integer, floatcol float, charcol char(10), varcharcol varchar, boolcol boolean, datecol date, timestampcol timestamp, timestamptzcol timestamptz, timecol time, timetzcol timetz, varbincol varbinary, bincol binary, numcol numeric(38,0), intervalcol interval ); 5. Run the following query to load data from the datasource file into the table: COPY alltypes COLUMN OPTION (varbincol FORMAT 'hex', bincol FORMAT 'hex') FROM '/path-to-datasource/datasource' DIRECT; Replace path-to-datasource with the absolute path to the datasource file located in the same directory where you ran MakeAllTypes.pl. HPE Vertica Analytic Database (7.2.x) Page 129 of 145
Using the MapReduce Connector Compiling the Example (optional) The example code presented in this section is based on example code distributed along with the Vertica Connector for Hadoop Map Reduce in the file hadoop-verticaexample.jar. If you just want to run the example, skip to the next section and use the hadoop-vertica-example.jar file that came as part of the connector package rather than a version you compiled yourself. To compile the example code listed in Example Vertica Connector for Hadoop Map Reduce Application, follow these steps: 1. Log into a node on your Hadoop cluster. 2. Locate the Hadoop home directory. See Installing the Connector for tips on how to find this directory. 3. If it is not already set, set the environment variable HADOOP_HOME to the Hadoop home directory: export HADOOP_HOME=path_to_Hadoop_home If you installed Hadoop using an.rpm or.deb package, Hadoop is usually installed in /usr/lib/hadoop: export HADOOP_HOME=/usr/lib/hadoop 4. Save the example source code to a file named VerticaExample.java on your Hadoop node. 5. In the same directory where you saved VerticaExample.java, create a directory named classes. On Linux, the command is: mkdir classes 6. Compile the Hadoop example: javac -classpath \ $HADOOP_HOME/hadoop-core.jar:$HADOOP_HOME/lib/hadoop-vertica.jar \ -d classes VerticaExample.java \ && jar -cvf hadoop-vertica-example.jar -C classes. HPE Vertica Analytic Database (7.2.x) Page 130 of 145
Using the MapReduce Connector Note: If you receive errors about missing Hadoop classes, check the name of the hadoop-code.jar file. Most Hadoop installers (including the Cloudera) create a symbolic link named hadoop-core.jar to a version specific.jar file (such as hadoop-core-0.20.203.0.jar). If your Hadoop installation did not create this link, you will have to supply the.jar file name with the version number. When the compilation finishes, you will have a file named hadoop-verticaexample.jar in the same directory as the VerticaExample.java file. This is the file you will have Hadoop run. Running the Example Application Once you have compiled the example, run it using the following command line: hadoop jar hadoop-vertica-example.jar \ com.vertica.hadoop.verticaexample \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \ -Dmapred.vertica.port=portNumber \ -Dmapred.vertica.username=userName \ -Dmapred.vertica.password=dbPassword \ -Dmapred.vertica.database=databaseName This command tells Hadoop to run your application's.jar file, and supplies the parameters needed for your application to connect to your Vertica database. Fill in your own values for the hostnames, port, user name, password, and database name for your Vertica database. After entering the command line, you will see output from Hadoop as it processes data that looks similar to the following: 12/01/11 10:41:19 INFO mapred.jobclient: Running job: job_201201101146_0005 12/01/11 10:41:20 INFO mapred.jobclient: map 0% reduce 0% 12/01/11 10:41:36 INFO mapred.jobclient: map 33% reduce 0% 12/01/11 10:41:39 INFO mapred.jobclient: map 66% reduce 0% 12/01/11 10:41:42 INFO mapred.jobclient: map 100% reduce 0% 12/01/11 10:41:45 INFO mapred.jobclient: map 100% reduce 22% 12/01/11 10:41:51 INFO mapred.jobclient: map 100% reduce 100% 12/01/11 10:41:56 INFO mapred.jobclient: Job complete: job_201201101146_0005 12/01/11 10:41:56 INFO mapred.jobclient: Counters: 23 12/01/11 10:41:56 INFO mapred.jobclient: Job Counters 12/01/11 10:41:56 INFO mapred.jobclient: Launched reduce tasks=1 12/01/11 10:41:56 INFO mapred.jobclient: SLOTS_MILLIS_MAPS=21545 12/01/11 10:41:56 INFO mapred.jobclient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/11 10:41:56 INFO mapred.jobclient: Total time spent by all maps waiting after reserving HPE Vertica Analytic Database (7.2.x) Page 131 of 145
Using the MapReduce Connector slots (ms)=0 12/01/11 10:41:56 INFO mapred.jobclient: Launched map tasks=3 12/01/11 10:41:56 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES=13851 12/01/11 10:41:56 INFO mapred.jobclient: File Output Format Counters 12/01/11 10:41:56 INFO mapred.jobclient: Bytes Written=0 12/01/11 10:41:56 INFO mapred.jobclient: FileSystemCounters 12/01/11 10:41:56 INFO mapred.jobclient: FILE_BYTES_READ=69 12/01/11 10:41:56 INFO mapred.jobclient: HDFS_BYTES_READ=318 12/01/11 10:41:56 INFO mapred.jobclient: FILE_BYTES_WRITTEN=89367 12/01/11 10:41:56 INFO mapred.jobclient: File Input Format Counters 12/01/11 10:41:56 INFO mapred.jobclient: Bytes Read=0 12/01/11 10:41:56 INFO mapred.jobclient: Map-Reduce Framework 12/01/11 10:41:56 INFO mapred.jobclient: Reduce input groups=1 12/01/11 10:41:56 INFO mapred.jobclient: Map output materialized bytes=81 12/01/11 10:41:56 INFO mapred.jobclient: Combine output records=0 12/01/11 10:41:56 INFO mapred.jobclient: Map input records=3 12/01/11 10:41:56 INFO mapred.jobclient: Reduce shuffle bytes=54 12/01/11 10:41:56 INFO mapred.jobclient: Reduce output records=1 12/01/11 10:41:56 INFO mapred.jobclient: Spilled Records=6 12/01/11 10:41:56 INFO mapred.jobclient: Map output bytes=57 12/01/11 10:41:56 INFO mapred.jobclient: Combine input records=0 12/01/11 10:41:56 INFO mapred.jobclient: Map output records=3 12/01/11 10:41:56 INFO mapred.jobclient: SPLIT_RAW_BYTES=318 12/01/11 10:41:56 INFO mapred.jobclient: Reduce input records=3 Note: The version of the example supplied in the Hadoop Connector download package will produce more output, since it runs several input queries. Verifying the Results Once your Hadoop application finishes, you can verify it ran correctly by looking at the mrtarget table in your Vertica database: Connect to your Vertica database using vsql and run the following query: => SELECT * FROM mrtarget; The results should look like this: a b c d f t v z -----+---+---+------------+---------+-------------------------+----------------+------------------- ----------------------- 125 t c 2012-01-11 234.567 2012-01-11 10:41:48.837 example string \000\000\000\000\000\000\000\000\000\000 (1 row) HPE Vertica Analytic Database (7.2.x) Page 132 of 145
Using the MapReduce Connector Using Hadoop Streaming with the Vertica Connector for Hadoop Map Reduce Hadoop streaming allows you to create an ad-hoc Hadoop job that uses standard commands (such as UNIX command-line utilities) for its map and reduce processing. When using streaming, Hadoop executes the command you pass to it a mapper and breaks each line from its standard output into key and value pairs. By default, the key and value are separated by the first tab character in the line. These values are then passed to the standard input to the command that you specified as the reducer. See the Hadoop wiki's topic on streaming for more information. You can have a streaming job retrieve data from an Vertica database, store data into an Vertica database, or both. Reading Data From Vertica in a Streaming Hadoop Job To have a streaming Hadoop job read data from an Vertica database, you set the inputformat argument of the Hadoop command line to com.vertica.deprecated.verticastreaminginput. You also need to supply parameters that tell the Hadoop job how to connect to your Vertica database. See Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time for an explanation of these command-line parameters. Note: The VerticaStreamingInput class is within the deprecated namespace because the current version of Hadoop (as of 0.20.1) has not defined a current API for streaming. Instead, the streaming classes conform to the Hadoop version 0.18 API. In addition to the standard command-line parameters that tell Hadoop how to access your database, there are additional streaming-specific parameters you need to use that supply Hadoop with the query it should use to extract data from Vertica and other queryrelated options. Parameter Description Required Default mapred.vertica.input.query The query to use to retrieve Yes none data from the Vertica database. See Setting the Query to HPE Vertica Analytic Database (7.2.x) Page 133 of 145
Using the MapReduce Connector Parameter Description Required Default Retrieve Data from Vertica for more information. mapred.vertica.input.paramquery A query to execute to retrieve If query has parameters for the query given parameter in the.input.query and no parameter. discrete parameters supplied mapred.vertica.query.params Discrete list of parameters for If query has the query. parameter and no parameter query supplied mapred.vertica.input.delimiter The character to use for No 0xa separating column values. The command you use as a mapper needs to split individual column values apart using this delimiter. mapred.vertica.input.terminator The character used to signal No 0xb the end of a row of data from the query result. The following command demonstrates reading data from a table named alltypes. This command uses the UNIX cat command as a mapper and reducer so it will just pass the contents through. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.port=5433 \ HPE Vertica Analytic Database (7.2.x) Page 134 of 145
Using the MapReduce Connector -Dmapred.vertica.input.query="SELECT key, intcol, floatcol, varcharcol FROM alltypes ORDER BY key" \ -Dmapred.vertica.input.delimiter=, \ -Dmapred.map.tasks=1 \ -inputformat com.vertica.hadoop.deprecated.verticastreaminginput \ -input /tmp/input -output /tmp/output -reducer /bin/cat -mapper /bin/cat The results of this command are saved in the /tmp/output directory on your HDFS filesystem. On a four-node Hadoop cluster, the results would be: # $HADOOP_HOME/bin/hadoop dfs -ls /tmp/output Found 5 items drwxr-xr-x - release supergroup 0 2012-01-19 11:47 /tmp/output/_logs -rw-r--r-- 3 release supergroup 88 2012-01-19 11:47 /tmp/output/part-00000 -rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00001 -rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00002 -rw-r--r-- 3 release supergroup 87 2012-01-19 11:47 /tmp/output/part-00003 # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00000 1 2,1,3165.75558015273,ONE, 5 6,5,1765.76024139635,ONE, 9 10,9,4142.54176256463,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00001 2 3,2,8257.77313710329,ONE, 6 7,6,7267.69718012601,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00002 3 4,3,443.188765520475,ONE, 7 8,7,4729.27825566408,ONE, # $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-00003 0 1,0,2456.83076632307,ONE, 4 5,4,9692.61214265391,ONE, 8 9,8,3327.25019418294,ONE,13 1015,15,15.1515,FIFTEEN 2 1003,3,333.0,THREE 3 1004,4,0.0,FOUR 4 1005,5,0.0,FIVE 5 1007,7,0.0,SEVEN 6 1008,8,1.0E36,EIGHT 7 1009,9,-1.0E36,NINE 8 1010,10,0.0,TEN 9 1011,11,11.11,ELEVEN Notes Even though the input is coming from Vertica, you need to supply the -input parameter to Hadoop for it to process the streaming job. The -Dmapred.map.tasks=1 parameter prevents multiple Hadoop nodes from reading the same data from the database, which would result in Hadoop processing multiple copies of the data. HPE Vertica Analytic Database (7.2.x) Page 135 of 145
Using the MapReduce Connector Writing Data to Vertica in a Streaming Hadoop Job Similar to reading from a streaming Hadoop job, you write data to Vertica by setting the outputformat parameter of your Hadoop command to com.vertica.deprecated.verticastreamingoutput. This class requires key/value pairs, but the keys are ignored. The values passed to VerticaStreamingOutput are broken into rows and inserted into a target table. Since keys are ignored, you can use the keys to partition the data for the reduce phase without affecting Vertica's data transfer. As with reading from Vertica, you need to supply parameters that tell the streaming Hadoop job how to connect to the database. See Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time for an explanation of these commandline parameters. If you are reading data from one Vertica database and writing to another, you need to use the output parameters, similarly if you were reading and writing to separate databases using a Hadoop application. There are also additional parameters that configure the output of the streaming Hadoop job, listed in the following table. Parameter Description Required Default mapred.vertica.output.table.name The name of the table where Yes none Hadoop should store its data. mapred.vertica.output.table.def The definition of the table. If the table The format is the same as does not used for defining the output already table for a Hadoop exist in application. See Defining the the Output Table for details. database mapred.vertica.output.table.drop Whether to truncate the table No false before adding data to it. mapred.vertica.output.delimiter The character to use for No 0x7 (ASCII separating column values. bell character) HPE Vertica Analytic Database (7.2.x) Page 136 of 145
Using the MapReduce Connector Parameter Description Required Default mapred.vertica.output.terminator The character used to signal No 0x8 (ASCII the end of a row of data.. backspace) The following example demonstrates reading two columns of data data from an Vertica database table named alltypes and writing it back to the same database in a table named hadoopout. The command provides the definition for the table, so you do not have to manually create the table beforehand. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.vertica.output.table.name=hadoopout \ -Dmapred.vertica.output.table.def="intcol integer, varcharcol varchar" \ -Dmapred.vertica.output.table.drop=true \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.input.query="SELECT intcol, varcharcol FROM alltypes ORDER BY key" \ -Dmapred.vertica.input.delimiter=, \ -Dmapred.vertica.output.delimiter=, \ -Dmapred.vertica.input.terminator=0x0a \ -Dmapred.vertica.output.terminator=0x0a \ -inputformat com.vertica.hadoop.deprecated.verticastreaminginput \ -outputformat com.vertica.hadoop.deprecated.verticastreamingoutput \ -input /tmp/input \ -output /tmp/output \ -reducer /bin/cat \ -mapper /bin/cat After running this command, you can view the result by querying your database: => SELECT * FROM hadoopout; intcol varcharcol --------+------------ 1 ONE 5 ONE 9 ONE 2 ONE 6 ONE 0 ONE 4 ONE 8 ONE 3 ONE 7 ONE (10 rows) Loading a Text File From HDFS into Vertica One common task when working with Hadoop and Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an Vertica table. You can load these files HPE Vertica Analytic Database (7.2.x) Page 137 of 145
Using the MapReduce Connector using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes. Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig. For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters ( ), with each line of the file is terminated by a carriage return: # $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt 1 1.0 ONE 2 2.0 TWO 3 3.0 THREE In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data. Below is an example of a mapper script written in Python. It performs two tasks: Strips the carriage returns from the input text and terminates each line with a tilde (~). Adds a key value (the string "streaming") followed by a tab character at the start of each line of the text file. The mapper script needs to do this because the streaming job to read text files skips the reducer stage. The reducer isn't necessary, since the all of the data being read in text file should be stored in the Vertica tables. However, VerticaStreamingOutput class requires key and values pairs, so the mapper script adds the key. #!/usr/bin/python import sys for line in sys.stdin.readlines(): # Get rid of carriage returns. # CR is used as the record terminator by Streaming.jar line = line.strip(); # Add a key. The key value can be anything. # The convention is to use the name of the # target table, as shown here. sys.stdout.write("streaming\t%s~\n" % line) HPE Vertica Analytic Database (7.2.x) Page 138 of 145
Using the MapReduce Connector The Hadoop command to stream text files from the HDFS into Vertica using the above mapper script appears below. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.reduce.tasks=0 \ -Dmapred.vertica.output.table.name=streaming \ -Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol varchar" \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.output.delimiter=" " \ -Dmapred.vertica.output.terminator="~" \ -input /tmp/textdata.txt \ -output output \ -mapper "python path-to-script/mapper.py" \ -outputformat com.vertica.hadoop.deprecated.verticastreamingoutput Notes The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does not need a reducer since the mapper script processes the data into the format that the VerticaStreamingOutput class expects. Even though the VerticaStreamingOutput class is handling the output from the mapper, you need to supply a valid output directory to the Hadoop command. The result of running the command is a new table in the Vertica database: => SELECT * FROM streaming; intcol floatcol varcharcol --------+----------+------------ 3 3 THREE 1 1 ONE 2 2 TWO (3 rows) Accessing Vertica From Pig The Vertica Connector for Hadoop Map Reduce includes a Java package that lets you access a Vertica database using Pig. You must copy this.jar to somewhere in your Pig installation's CLASSPATH (see Installing the Connector for details). HPE Vertica Analytic Database (7.2.x) Page 139 of 145
Using the MapReduce Connector Registering the Vertica.jar Files Before it can access Vertica, your Pig Latin script must register the Vertica-related.jar files. All of your Pig scripts should start with the following commands: REGISTER 'path-to-pig-home/lib/vertica-jdbc-7.2..x.jar'; REGISTER 'path-to-pig-home/lib/pig-vertica.jar'; These commands ensure that Pig can locate the Vertica JDBC classes, as well as the interface for the connector. Reading Data From Vertica To read data from a Vertica database, you tell Pig Latin's LOAD statement to use a SQL query and to use the VerticaLoader class as the load function. Your query can be hard coded, or contain a parameter. See Setting the Query to Retrieve Data from Vertica for details. Note: You can only use a discrete parameter list or supply a query to retrieve parameter values you cannot use a collection to supply parameter values as you can from within a Hadoop application. The format for calling the VerticaLoader is: com.vertica.pig.verticaloader('hosts','database','port','username','password'); hosts database port username password A comma-separated list of the hosts in the Vertica cluster. The name of the database to be queried. The port number for the database. The username to use when connecting to the database. The password to use when connecting to the database. This is the only optional parameter. If not present, the connector assumes the password is empty. The following Pig Latin command extracts all of the data from the table named alltypes using a simple query: A = LOAD 'sql://{select * FROM alltypes ORDER BY key}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03', HPE Vertica Analytic Database (7.2.x) Page 140 of 145
Using the MapReduce Connector 'ExampleDB','5433','ExampleUser','password123'); This example uses a parameter and supplies a discrete list of parameter values: A = LOAD 'sql://{select * FROM alltypes WHERE key =?};{1,2,3}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03', 'ExampleDB','5433','ExampleUser','password123'); This final example demonstrates using a second query to retrieve parameters from the Vertica database. A = LOAD 'sql://{select * FROM alltypes WHERE key =?};sql://{select DISTINCT key FROM alltypes}' USING com.vertica.pig.verticaloader('vertica01,vertica02,vertica03','exampledb', '5433','ExampleUser','password123'); Writing Data to Vertica To write data to a Vertica database, you tell Pig Latin's STORE statement to save data to a database table (optionally giving the definition of the table) and to use the VerticaStorer class as the save function. If the table you specify as the destination does not exist, and you supplied the table definition, the table is automatically created in your Vertica database and the data from the relation is loaded into it. The syntax for calling the VerticaStorer is the same as calling VerticaLoader: com.vertica.pig.verticastorer('hosts','database','port','username','password'); The following example demonstrates saving a relation into a table named hadoopout which must already exist in the database: STORE A INTO '{hadoopout}' USING com.vertica.pig.verticastorer('vertica01,vertica02,vertica03','exampledb','5433', 'ExampleUser','password123'); This example shows how you can add a table definition to the table name, so that the table is created in Vertica if it does not already exist: STORE A INTO '{outtable(a int, b int, c float, d char(10), e varchar, f boolean, g date, h timestamp, i timestamptz, j time, k timetz, l varbinary, m binary, n numeric(38,0), o interval)}' USING com.vertica.pig.verticastorer('vertica01,vertica02,vertica03','exampledb','5433', 'ExampleUser','password123'); Note: If the table already exists in the database, and the definition that you supply differs from the table's definition, the table is not dropped and recreated. This may HPE Vertica Analytic Database (7.2.x) Page 141 of 145
Using the MapReduce Connector cause data type errors when data is being loaded. HPE Vertica Analytic Database (7.2.x) Page 142 of 145
Integrating Vertica with the MapR Distribution of Hadoop Integrating Vertica with the MapR Distribution of Hadoop MapR is a distribution of Apache Hadoop produced by MapR Technologies that extends the standard Hadoop components with its own features. By adding Vertica to a MapR cluster, you can benefit from the advantages of both Vertica and Hadoop. To learn more about integrating Vertica and MapR, see Configuring HPE Vertica Analytics Platform with MapR,which appears on the MapR website. HPE Vertica Analytic Database (7.2.x) Page 143 of 145
Integrating Vertica with the MapR Distribution of Hadoop HPE Vertica Analytic Database (7.2.x) Page 144 of 145
Send Documentation Feedback If you have comments about this document, you can contact the documentation team by email. If an email client is configured on this system, click the link above and an email window opens with the following information in the subject line: Feedback on Integrating with Hadoop (Vertica Analytic Database 7.2.x) Just add your feedback to the email and click send. If no email client is available, copy the information above to a new message in a web mail client, and send your feedback to vertica-docfeedback@hpe.com. We appreciate your feedback! HPE Vertica Analytic Database (7.2.x) Page 145 of 145