Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Transcription

1 HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 12/7/2015

2 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Restricted Rights Legend Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with FAR and , Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Copyright Notice Copyright 2015 Hewlett Packard Enterprise Development LP Trademark Notices Adobe is a trademark of Adobe Systems Incorporated. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. This product includes an interface of the 'zlib' general purpose compression library, which is Copyright Jean-loup Gailly and Mark Adler. HPE Vertica Analytic Database (7.2.x) Page 2 of 129

3 Contents Introduction to Hadoop Integration 8 Hadoop Distributions 8 Integration Options 8 Cluster Layout 10 Co-Located Clusters 10 Hardware Recommendations 10 Configuring Hadoop for Co-Located Clusters 12 webhdfs 12 YARN 12 Hadoop Balancer 12 Replication Factor 13 Disk Space for Non-HDFS Use 13 Separate Clusters 13 Choosing Which Hadoop Interface to Use 16 Creating an HDFS Storage Location 16 Using the ORC Reader 16 Using the HCatalog Connector 16 Using the HDFS Connector 17 Using the MapReduce Connector 17 Using Kerberos with Hadoop 18 How Vertica uses Kerberos With Hadoop 18 User Authentication 18 Vertica Authentication 19 See Also 20 Configuring Kerberos 21 Prerequisite: Setting Up Users and the Keytab File 21 HCatalog Connector 21 HDFS Connector 21 HDFS Storage Location 22 Token Expiration 22 See Also 22 Using the ORC Reader 23 Syntax 23 Supported Data Types 24 HPE Vertica Analytic Database (7.2.x) Page 3 of 129

4 Timestamps 24 Kerberos 25 Query Performance 25 Examples 26 Using the HCatalog Connector 27 Hive, HCatalog, and WebHCat Overview 27 HCatalog Connection Features 27 HCatalog Connection Considerations 28 How the HCatalog Connector Works 28 HCatalog Connector Requirements 29 Vertica Requirements 29 Hadoop Requirements 29 Testing Connectivity 30 Installing the Java Runtime on Your Vertica Cluster 31 Installing a Java Runtime 31 Setting the JavaBinaryForUDx Configuration Parameter 32 Configuring Vertica for HCatalog 33 Copy Hadoop Libraries and Configuration Files 33 Install the HCatalog Connector 35 Using the HCatalog Connector with HA NameNode 36 Defining a Schema Using the HCatalog Connector 36 Querying Hive Tables Using HCatalog Connector 38 Viewing Hive Schema and Table Metadata 38 Synching an HCatalog Schema With a Local Schema 43 Data Type Conversions from Hive to Vertica 44 Data-Width Handling Differences Between Hive and Vertica 45 Using Non-Standard SerDes 46 Determining Which SerDe You Need 46 Installing the SerDe on the Vertica Cluster 47 Troubleshooting HCatalog Connector Problems 48 Connection Errors 48 UDx Failure When Querying Data: Error SerDe Errors 50 Differing Results Between Hive and Vertica Queries 51 Preventing Excessive Query Delays 51 Using the HDFS Connector 52 HDFS Connector Requirements 52 webhdfs Requirements 52 HPE Vertica Analytic Database (7.2.x) Page 4 of 129

5 Kerberos Authentication Requirements 53 Testing Your Hadoop webhdfs Configuration 53 Loading Data Using the HDFS Connector 55 The HDFS File URL 56 Copying Files in Parallel 57 Viewing Rejected Rows and Exceptions 58 Creating an External Table Based on HDFS Files 58 Load Errors in External Tables 60 HDFS ConnectorTroubleshooting Tips 60 User Unable to Connect to Kerberos-Authenticated Hadoop Cluster 60 Resolving Error Transfer Rate Errors 62 Using HDFS Storage Locations 64 Storage Location for HDFS Requirements 64 HDFS Space Requirements 65 Additional Requirements for Backing Up Data Stored on HDFS 65 How the HDFS Storage Location Stores Data 66 What You Can Store on HDFS 66 What HDFS Storage Locations Cannot Do 66 Creating an HDFS Storage Location 67 Creating a Storage Location Using Vertica for SQL on Hadoop 68 Adding HDFS Storage Locations to New Nodes 68 Creating a Storage Policy for HDFS Storage Locations 69 Storing an Entire Table in an HDFS Storage Location 69 Storing Table Partitions in HDFS 70 Moving Partitions to a Table Stored on HDFS 72 Backing Up Vertica Storage Locations for HDFS 73 Configuring Vertica to Restore HDFS Storage Locations 74 Configuration Overview 74 Installing a Java Runtime 75 Finding Your Hadoop Distribution's Package Repository 75 Configuring Vertica Nodes to Access the Hadoop Distribution s Package Repository 76 Installing the Required Hadoop Packages 77 Setting Configuration Parameters 79 Setting Kerberos Parameters 80 Confirming that distcp Runs 80 Troubleshooting 81 Configuring Hadoop and Vertica to Enable Backup of HDFS Storage 82 Granting Superuser Status on Hortonworks Granting Superuser Status on Cloudera Manually Enabling Snapshotting for a Directory 83 Additional Requirements for Kerberos 84 Testing the Database Account's Ability to Make HDFS Directories Snapshottable 84 HPE Vertica Analytic Database (7.2.x) Page 5 of 129

6 Performing Backups Containing HDFS Storage Locations 84 Removing HDFS Storage Locations 85 Removing Existing Data from an HDFS Storage Location 85 Moving Data to Another Storage Location 86 Clearing Storage Policies 87 Changing the Usage of HDFS Storage Locations 89 Dropping an HDFS Storage Location 90 Removing Storage Location Files from HDFS 90 Removing Backup Snapshots 90 Removing the Storage Location Directories 91 Troubleshooting HDFS Storage Locations 92 HDFS Storage Disk Consumption 92 Kerberos Authentication When Creating a Storage Location 94 Backup or Restore Fails When Using Kerberos 94 Using the MapReduce Connector 95 Vertica Connector for Hadoop Features 95 Prerequisites 95 Hadoop and Vertica Cluster Scaling 96 Installing the Connector 96 Accessing Vertica Data From Hadoop 98 Selecting VerticaInputFormat 98 Setting the Query to Retrieve Data From Vertica 99 Using a Simple Query to Extract Data From Vertica 99 Using a Parameterized Query and Parameter Lists 100 Using a Discrete List of Values 100 Using a Collection Object 100 Scaling Parameter Lists for the Hadoop Cluster 101 Using a Query to Retrieve Parameter Values for a Parameterized Query 102 Writing a Map Class That Processes Vertica Data 102 Working with the VerticaRecord Class 102 Writing Data to Vertica From Hadoop 104 Configuring Hadoop to Output to Vertica 104 Defining the Output Table 104 Writing the Reduce Class 105 Storing Data in the VerticaRecord 106 Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time 109 Specifying the Location of the Connector.jar File 109 Specifying the Database Connection Parameters 109 Parameters for a Separate Output Database 110 Example Vertica Connector for Hadoop Map Reduce Application 111 Compiling and Running the Example Application 115 Compiling the Example (optional) 116 Running the Example Application 117 HPE Vertica Analytic Database (7.2.x) Page 6 of 129

7 Verifying the Results 118 Using Hadoop Streaming with the Vertica Connector for Hadoop Map Reduce 119 Reading Data From Vertica in a Streaming Hadoop Job 119 Writing Data to Vertica in a Streaming Hadoop Job 121 Loading a Text File From HDFS into Vertica 123 Accessing Vertica From Pig 125 Registering the Vertica.jar Files 125 Reading Data From Vertica 125 Writing Data to Vertica 126 Integrating Vertica with the MapR Distribution of Hadoop 128 Send Documentation Feedback 129 HPE Vertica Analytic Database (7.2.x) Page 7 of 129

8 Introduction to Hadoop Integration Introduction to Hadoop Integration Apache Hadoop, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System. You can use HDFS from Vertica in several ways: You can import HDFS data into locally-stored ROS files. You can access HDFS data in place, using external tables. You can use HDFS as a storage location for ROS files. Hadoop includes two other components of interest: Hive, a data warehouse that provides the ability to query data stored in Hadoop. HCatalog, a component that makes Hive metadata available to applications, such as Vertica, outside of Hadoop. A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop. Hadoop Distributions Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported. Integration Options Vertica supports two cluster architectures. Which you use affects the decisions you make about integration. You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of local data. This option is supported only for Vertica for SQL on Hadoop. You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop. This option is not supported for Vertica for SQL on Hadoop.. These layout options are described in Cluster Layout. Both layouts support several interfaces for using Hadoop: HPE Vertica Analytic Database (7.2.x) Page 8 of 129

9 Introduction to Hadoop Integration An HDFS Storage Location uses HDFS to hold Vertica data (ROS files). The HCatalog Connector lets Vertica query data that is stored in a Hive database the same way you query data stored natively in a Vertica schema. The ORC Reader lets Vertica query data that is stored in the ORC format native to Hadoop. This is faster than using the HCatalog Connector for this type of data. The HDFS Connector lets Vertica import HDFS data. It also lets VerticaVertica read HDFS data as an external table without using Hive. The MapReduce Connector lets you create Hadoop MapReduce jobs that retrieve data from Vertica. These jobs can also insert data into Vertica. HPE Vertica Analytic Database (7.2.x) Page 9 of 129

10 Cluster Layout Cluster Layout Vertica and Hadoop each use a cluster of nodes for distributed processing. These clusters can be co-located, meaning you run both products on the same machines, or separate. Co-Located Clusters are for use with Vertica for SQL on Hadoop licenses. Separate Clusters are for use with Premium Edition and Community Edition licenses. Co-Located Clusters With co-located clusters, Vertica is installed on some or all of your Hadoop nodes. The Vertica nodes use a private network in addition to the public network used by all Hadoop nodes, as the following figure shows: You might choose to place Vertica on all of your Hadoop nodes or only on some of them. If you are using HDFS Storage Locations you should use at least three Vertica nodes, the minimum number for K-Safety. Using more Vertica nodes can improve performance because the HDFS data needed by a query is more likely to be local. Normally, both Hadoop and Vertica use the entire node. Because this configuration uses shared nodes, you must address potential resource contention in your configuration on those nodes. See Configuring Hadoop for Co-Located Clusters for more information. No changes are needed on Hadoop-only nodes. You can place Hadoop and Vertica clusters within a single rack, or you can span across many racks and nodes. Spreading node types across racks can improve efficiency. Hardware Recommendations Hadoop clusters frequently do not have identical provisioning requirements or hardware configurations. However, Vertica nodes should be equivalent in size and capability, per HPE Vertica Analytic Database (7.2.x) Page 10 of 129

11 Cluster Layout the best-practice standards recommended in General Hardware and OS Requirements and Recommendations in Installing Vertica. Because Hadoop cluster specifications do not always meet these standards, Hewlett Packard Enterprise recommends the following specifications for Vertica nodes in your Hadoop cluster. Specifications For... Processor Recommendation For best performance, run: Two-socket servers with 8 14 core CPUs, clocked at or above 2.6 GHz for clusters over 10 TB Single-socket servers with 8 12 cores clocked at or above 2.6 GHz for clusters under 10 TB Memory Distribute the memory appropriately across all memory channels in the server: Minimum 8 GB of memory per physical CPU core in the server High-performance applications GB of memory per physical core Type at least DDR3-1600, preferably DDR Storage Read/write: Minimum 40 MB/s per physical core of the CPU For best performance MB/s per physical core Storage post RAID: Each node should have 1 9 TB. For a production setting, RAID 10 is recommended. In some cases, RAID 50 is acceptable. Because of the heavy compression and encoding that Verticadoes, SSDs are not required. In most cases, a RAID of more, lessexpensive HDDs performs just as well as a RAID of fewer SSDs. If you intend to use RAID 50 for your data partition, you should keep a spare node in every rack, allowing for manual failover of a Vertica node in the case of a drive failure. A Vertica node recovery is faster than a RAID 50 rebuild. Also, be sure to never put more than 10 TB compressed on any node, to keep node recovery times at an acceptable rate. HPE Vertica Analytic Database (7.2.x) Page 11 of 129

12 Cluster Layout Network 10 GB networking in almost every case. With the introduction of 10 GB over cat6a (Ethernet), the cost difference is minimal. Configuring Hadoop for Co-Located Clusters If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements. webhdfs Hadoop has two services that can provide web access to HDFS: webhdfs httpfs For Vertica, you must use the webhdfs service. YARN The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on YARNmanaged Hadoop nodes you must make some changes in YARN. HPE recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution: If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica. If you are using Cloudera, enable and configure static service pools. Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes. Hadoop Balancer The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions. HPE Vertica Analytic Database (7.2.x) Page 12 of 129

13 Cluster Layout If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it. This degrades performance. This can also occur when reading ORC files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.) To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this. Replication Factor By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection. To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in Troubleshooting HDFS Storage Locations. Disk Space for Non-HDFS Use You also need to reserve some disk space for non-hdfs use. To reserve disk space using Ambari, set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file. Setting this parameter preserves space for non-hdfs files that Vertica requires. Separate Clusters In the Premium Edition product, your Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for Vertica for SQL on Hadoop, in which Vertica nodes are co-located on Hadoop nodes. The following figure illustrates the configuration for separate clusters:: HPE Vertica Analytic Database (7.2.x) Page 13 of 129

14 Cluster Layout The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network. The layout shown in the figure calls for two networks, and there are benefits to adding a third: Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks. Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network. Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve HPE Vertica Analytic Database (7.2.x) Page 14 of 129

15 Cluster Layout performance. If the configuration does not support a client network, than client connections should use the shared network. HPE Vertica Analytic Database (7.2.x) Page 15 of 129

16 Choosing Which Hadoop Interface to Use Choosing Which Hadoop Interface to Use Vertica provides several ways to interact with data stored in Hadoop. This section explains how to choose among them. Decisions about Cluster Layout can affect the decisions you make about Hadoop interfaces. Creating an HDFS Storage Location Using a storage location to store data in the Vertica native file format (ROS) delivers the best query performance among the available Hadoop options. (Storing ROS files on the local disk rather than in Hadoop is faster still.) If you already have data in Hadoop, however, doing this means you are importing that data into Vertica. For co-located clusters, which does not use local file storage, you might still choose to use an HDFS storage location for better performance. You can use the HDFS Connector to load data that is already in HDFS into Vertica. For separate clusters, which use local file storage, consider using an HDFS storage location for lower-priority data. See Using HDFS Storage Locations and Using the HDFS Connector. Using the ORC Reader If your data is stored in the Optimized Row Columnar format, an open format supported by most Hadoop providers, Vertica can query that data directly from HDFS. This is faster than using the HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database. The ORC Reader reads the data in place; no extra copies are made. See Using the ORC Reader. Using the HCatalog Connector The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface you can read all file formats supported by Hadoop, including Parquet and ORC, and Vertica can use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog Connector is also sensitive to HPE Vertica Analytic Database (7.2.x) Page 16 of 129

17 Choosing Which Hadoop Interface to Use changes in the Hadoop libraries on which it depends; upgrading your Hadoop cluster might affect your HCatalog connections. See Using the HCatalog Connector. Using the HDFS Connector The HDFS Connector can be used to create and query external tables, reading the data in place rather than making copies. The HDFS Connector can be used with any data format for which a parser is available. It does not use Hive data; you have to define the table yourself. Its performance can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of columnar file formats. See Using the HDFS Connector. Using the MapReduce Connector The other interfaces described in this section allow you to read Hadoop data from Vertica or create Vertica data in Hadoop. The MapReduce Connector, in contrast, allows you to integrate with Hadoop's MapReduce jobs. Use this connector to send Vertica data to MapReduce or to have MapReduce jobs create data in Vertica. See Using the MapReduce Connector. HPE Vertica Analytic Database (7.2.x) Page 17 of 129

18 Using Kerberos with Hadoop Using Kerberos with Hadoop If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure Vertica to make authenticated connections. The details of this configuration vary, based on which methods you are using to access HDFS data: How Vertica uses Kerberos With Hadoop Configuring Kerberos How Vertica uses Kerberos With Hadoop Vertica authenticates with Hadoop in two ways that require different configurations: User Authentication On behalf of the user, by passing along the user's existing Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector. Vertica Authentication On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file. User Authentication To use Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, Vertica automatically retrieves this ticket.the database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The following figure shows how the user, Vertica, Hadoop, and Kerberos interact in user authentication: HPE Vertica Analytic Database (7.2.x) Page 18 of 129

19 Using Kerberos with Hadoop When using the HDFS Connector or the HCatalog Connector, or when reading an ORC file stored in HDFS, Vertica uses the client identity as the preceding figure shows. Vertica Authentication Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos. The following figure shows the interactions required for Vertica authentication: HPE Vertica Analytic Database (7.2.x) Page 19 of 129

20 Using Kerberos with Hadoop Each Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node. When creating HDFS storage locations Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement. See Also For specific configuration instructions, see Configuring Kerberos. HPE Vertica Analytic Database (7.2.x) Page 20 of 129

21 Using Kerberos with Hadoop Configuring Kerberos Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and Vertica clusters. Prerequisite: Setting Up Users and the Keytab File If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure for Kerberos Authentication. In particular: Create one Kerberos principal per node. Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (see Specify the Location of the Keytab File). Set KerberosServiceName to the name of the principal (see Inform About the Kerberos Principal). HCatalog Connector You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS. In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatutil (see Configuring Vertica for HCatalog). HDFS Connector The HDFS Connector loads data from HDFS into Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS. HPE Vertica Analytic Database (7.2.x) Page 21 of 129

22 Using Kerberos with Hadoop HDFS Storage Location You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps: Grant Hadoop superuser privileges to the new principals. Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations. Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual. Token Expiration Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds: => ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400'; If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS. See Also How Vertica uses Kerberos With Hadoop Troubleshooting Kerberos Authentication HPE Vertica Analytic Database (7.2.x) Page 22 of 129

23 Using the ORC Reader Using the ORC Reader If your HDFS data is in the Optimized Row Columnar (ORC) format and uses no complex data types, then instead of using the HCatalog Connector you can use the ORC Reader to access the data directly. Reading directly may provide better performance. The decisions you make when writing ORC files can affect performance when using them. To get the best performance from the ORC Reader, do the following when writing: Use the latest available Hive version to write ORC files. (You can still read them with earlier versions.) Use a large stripe size; 256MB or greater is preferred. Partition the data at the table level. Sort the columns based on frequency of access, most-frequent first. Use Snappy or ZLib compression. Syntax In the COPY statement, specify a format of ORC as follows: COPY tablename FROM path ORC; In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC as follows: CREATE EXTERNAL TABLE tablename (columns) AS COPY FROM path ORC; If the file resides on the local file system of the node where you are issuing the command, use a local file path for path. If the file resides elsewhere in HDFS, use the webhdfs:// prefix and then specify the host name, port, and file path. Use ON ANY NODE for files that are not local to improve performance. COPY t FROM 'webhdfs://somehost:port/opt/data/orcfile' ON ANY NODE ORC; The ORC reader supports ZLib and Snappy compression. It does not support GZIP, BZIP, or LZO compression. The CREATE EXTERNAL TABLE AS COPY statement must consume all of the columns in the ORC file; unlike with some other data sources, you cannot select only HPE Vertica Analytic Database (7.2.x) Page 23 of 129

24 Using the ORC Reader the columns of interest. If you omit columns the ORC reader aborts with an error and does not copy any data. If you load from multiple ORC files in the same COPY statement and any of them is aborted, the entire load is aborted. This is different behavior than for delimited files, where the COPY statement loads what it can and ignores the rest. Supported Data Types The Vertica ORC file reader can natively read columns of all data types supported in HIVE version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The ORC reader does not attempt to read only some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types. Timestamps Reading timestamps from an ORC file in Vertica might result in different values than reading the same ORC file with another tool such as Hive. ORC files store timestamps without any time zone information. Vertica interprets timestamps without time zones as values in the local time zone, which might not be the time zone from which the ORC file was written. If you know which time zone was used as the reference to write the ORC file, you can set the default time zone in Vertica to correct the problem. For example, suppose the following values were written in an ORC file in the America/New_York time zone: :00: :00:00 When you read these values into Vertica the values might change, depending on your local time zone: => CREATE EXTERNAL TABLE t (ts timestamp) AS COPY FROM '/path/to/file.orc' orc; CREATE TABLE => select ts from t ; ts :00: :00:00 To adjust ORC time zones, set the TIMEZONE variable in Vertica: => SET TIMEZONE 'America/New_York' ; HPE Vertica Analytic Database (7.2.x) Page 24 of 129

25 Using the ORC Reader SET => SELECT ts AT timezone 'GMT' as tstz from t ; tstz :00: :00:00-04 The TIMEZONE variable is global to your Vertica cluster and affects all timestamp values that do not include a time zone. If you have ORC files written in more than one time zone, you cannot adjust all of them using this approach. For more information about the TIMEZONE variable, see Using Time Zones With Vertica. When Hive writes ORC files it converts dates before 1583 to the Julian calendar. Vertica does not perform this conversion. If your ORC file contains dates before this time, values in Hive and the corresponding values in Vertica will differ by up to ten days. This applies to both DATE and TIMESTAMP values. Kerberos If the ORC file is located on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal. Query Performance When working with external tables in ORC format, Vertica tries to improve performance in two ways: by pushing query execution closer to the data so less has to be read and transmitted, and by taking advantage of data locality in planning the query. Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics and row group indexes. Predicates are applied only to the first two levels. Predicate pushdown works and is automatically applied for ORC files written with HIVE version 0.14 and later. ORC files written with earlier versions of HIVE might not contain the required statistics. When executing a query against an ORC file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_ SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events. In a cluster where Vertica nodes are co-located on HDFS nodes, the query can also take advantage of data locality. If data is on an HDFS node where a database node is also present, and if the query is not restricted to specific nodes using ON NODE, then HPE Vertica Analytic Database (7.2.x) Page 25 of 129

26 Using the ORC Reader the query planner uses that database node to read that data. This allows Vertica to read data locally instead of making a network call. You can see how much ORC data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data. Examples The following example shows how to read from all ORC files in a directory. It uses all supported data types. CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5)) AS COPY FROM '/data/orc_test_*.orc' ORC; The following example shows the error that is produced if the file you specify is not recognized as an ORC file: CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC; ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file HPE Vertica Analytic Database (7.2.x) Page 26 of 129

27 Using the HCatalog Connector Using the HCatalog Connector The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table. If your files are in the Optimized Columnar Row (ORC) format, you might be able to read them directly instead of going through this connector. For more information, see Using the ORC Reader. Hive, HCatalog, and WebHCat Overview There are several Hadoop components that you need to understand in order to use the HCatalog connector: Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig). WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema. Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in Vertica that corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to Vertica. You can even perform operations such as joins between Vertica-native tables and HCatalog tables. For more details, see How the HCatalog Connector Works. HCatalog Connection Features The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are: HPE Vertica Analytic Database (7.2.x) Page 27 of 129

28 Using the HCatalog Connector The HCatalog Connector always reflects the current state of data stored in Hive. The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive. Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster. The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table. HCatalog Connection Considerations There are a few things to keep in mind when using the HCatalog Connector: Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica through the HCatalog Connector or the WebHDFS connector. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive. Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data Type Conversions from Hive to Vertica. Note: The HCatalog Connector is read only. It cannot insert data into Hive. How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the HPE Vertica Analytic Database (7.2.x) Page 28 of 129

29 Using the HCatalog Connector query from the Hadoop HDFS. They then use the Hive SerDe classes to extract the data so the query can execute. This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster. HCatalog Connector Requirements Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements. Vertica Requirements All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the Java Runtime on Your Vertica Cluster. You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. See Configuring Vertica for HCatalog. Hadoop Requirements Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog: HPE Vertica Analytic Database (7.2.x) Page 29 of 129

30 Using the HCatalog Connector It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more information. It must have WebHCat (formerly known as Templeton) installed and running. See Apache' s WebHCat page for details. The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass WebHCat, metastore database, and HDFS traffic. The data that you want to query must be in an internal or external Hive table. If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. See Using Non- Standard SerDes. Testing Connectivity To test the connection between your database cluster and WebHcat, log into a node in your Vertica cluster. Then, run the following command to execute an HCatalog query: $ curl Where: webhcatserver is the IP address or hostname of the WebHCat server port is the port number assigned to the WebHCat service (usually 50111) hcatusername is a valid username authorized to use HCatalog Usually, you want to append ;echo to the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read. For example: $ curl echo If there are no errors, this command returns a status message in JSON format, similar to the following: {"status":"ok","version":"v1"} This result indicates that WebHCat is running and that the Vertica host can connect to it and retrieve a result. If you do not receive this result, troubleshoot your Hadoop HPE Vertica Analytic Database (7.2.x) Page 30 of 129

31 Using the HCatalog Connector installation and the connectivity between your Hadoop and Vertica clusters. For details, see Troubleshooting HCatalog Connector Problems. You can also run some queries to verify that WebHCat is correctly configured to work with Hive. The following example demonstrates listing the databases defined in Hive and the tables defined within a database: $ curl echo {"databases":["default","production"]} $ curl echo {"tables":["messages","weblogs","tweets","transactions"],"database":"default"} See Apache's WebHCat reference for details about querying Hive using WebHCat. Installing the Java Runtime on Your Vertica Cluster The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes. Note: If your Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See Developing User Defined Functions in Java in Extending Vertica for more information. Installing Java on your Vertica cluster is a two-step process: 1. Install a Java runtime on all of the hosts in your cluster. 2. Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable. Installing a Java Runtime For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE. Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK. To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions. HPE Vertica Analytic Database (7.2.x) Page 31 of 129

32 Using the HCatalog Connector Once you have installed a JVM on each host, ensure that the java command is in the search path and calls the correct JVM by running the command: $ java -version This command should print something similar to: java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b06) Java HotSpot(TM) 64-Bit Server VM (build b01, mixed mode) Note: Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime. Setting the JavaBinaryForUDx Configuration Parameter The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell: $ which java /usr/bin/java If the java command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default (which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java. You set the configuration parameter by executing the following statement as a database superuser: => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; See ALTER DATABASE for more information on setting configuration parameters. To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table: HPE Vertica Analytic Database (7.2.x) Page 32 of 129

33 Using the HCatalog Connector => \x Expanded display is on. => SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx'; -[ RECORD 1 ] node_name ALL parameter_name JavaBinaryForUDx current_value /usr/bin/java default_value change_under_support_guidance f change_requires_restart f description Path to the java binary for executing UDx written in Java Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster. Note: Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster. Configuring Vertica for HCatalog Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those configuration files to make its own connections to Hadoop. You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector. Copy Hadoop Libraries and Configuration Files Vertica provides a tool, hcatutil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries: If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so. If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example: hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar; hive> create external table nationjson (id int,name string,rank int,text string) HPE Vertica Analytic Database (7.2.x) Page 33 of 129

34 Using the HCatalog Connector ROW FORMAT SERDE 'org.openx.data.jsonserde.jsonserde' LOCATION '/user/release/vt/nationjson'; If either of these cases applies to you, do one of the following: Include the path(s) in the path you specify as the value of --hcatlibpath, or Copy the file(s) to a directory already on that path. If Vertica is not co-located on a Hadoop node, you should do the following: 1. Copy /opt/vertica/packages/hcat/tools/hcatutil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory. Use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/tmp/hadoop-files 2. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/tmp/hadoop-files 3. Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar. 4. Verify that all necessary files were copied: hcatutil --verifyjars --hcatlibpath=/opt/vertica/packages/hcat If you are using the Vertica for SQL on Hadoop product with co-located clusters, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command: hcatutil --copyjars --hadoophivehome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoophiveconfpath="/hadoop;/hive;/webhcat" --hcatlibpath=/opt/vertica/packages/hcat/lib The hcatutil script has the following arguments: HPE Vertica Analytic Database (7.2.x) Page 34 of 129