How to Install and Configure EBF15328 for MapR or with MapReduce v1

Transcription

1 How to Install and Configure EBF15328 for MapR or with MapReduce v Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

2 Abstract Enable Big Data Edition to run mappings on a Hadoop cluster on MapR or MapR Supported Versions Informatica Big Data Edition HotFix 2 Table of Contents Overview Step 1. Download EBF Step 2. Update the Informatica Domain Applying EBF15328 to the Informatica Domain Configuring MapR Distribution Variables for Mappings in a Hive Environment Configuring hive-site.xml Configuring the hadoopenv.properties File for MapR Step 3. Update the Hadoop Cluster Applying EBF15328 to the Hadoop Cluster Configuring a Hive Metastore in hive-site.xml Configuring the Heap Space for the MapR-FS Configuring the hadoopres.properties File for MapR Verifying the Cluster Details Step 4. Update the Developer Tool Applying EBF15328 to the Informatica Client Configuring the Developer Tool Step 5. Update PowerCenter Updating the Repository Plugin Configuring the PowerCenter Integration Service Copying MapR Distribution Files for PowerCenter Mappings in the Native Environment Enable User Impersonation for Native and Hive Execution Environments Enable Hive Pushdown for Hbase Connections Overview HDFS Connection Properties HBase Connection Properties Hive Connection Properties Creating a Connection Known Limitations Overview EBF15328 adds support for MapR and with MapReduce 1 to Informatica HotFix 2. 2

3 Note: Teradata Connector for Hadoop (Command Line Edition) does not support MapR or Only MapR 3.1 is supported. To apply the EBF and configure Informatica, perform the following tasks: 1. Download the EBF. 2. Update the Informatica domain Note: If the Data Integration Service runs on a machine that uses SUSE 11, the native mode of execution and Hive pushdown are not supported. Use a Data Integration Service that runs on a machine that uses RHEL. 3. Update the Hadoop cluster 4. Update the Developer tool client 5. Update PowerCenter Optionally, you can enable support for user impersonation and Hbase. Step 1. Download EBF15328 Before you enable MapR or for Informatica HotFix 2, download the EBF. 1. Open a browser. 2. In the address field, enter the following URL: 3. Navigate to the following directory: /updates/informatica9/9.6.1 HotFix2/EBF Download the following files: EBF15328.Linux64-X86.tar.gz Contains the EBF installer for the Informatica domain and the Hadoop cluster. EBF15328_Client_Installer_win32_x86.zip Contains the EBF installer for the Informatica client. Use this file to update the Developer tool. 5. Extract the files from EBF15328.Linux64-X86.tar.gz. The EBF15328.Linux64-X86.tar.gz file contains the following.tar files: EBF15328_Server_installer_linux_em64t.tar EBF installer for the Informatica domain. Use this file to update the Informatica domain. EBF15328_HadoopRPM_EBFInstaller.tar EBF installer for the Hadoop RPM. Use this file to update the Hadoop cluster. Step 2. Update the Informatica Domain Update the Informatica domain to enable MapR or Note: If the Data Integration Service runs on a machine that uses SUSE 11, the native mode of execution and Hive pushdown are not supported. Use a Data Integration Service that runs on a machine that uses RHEL. Perform the following tasks: 1. Apply the EBF to the Informatica domain 2. Configure MapR distribution variables for mappings in a Hive Environment 3. Configure hive-site.xml 4. To enable support for MapR 4.0.1, configure the hadoopenv.properties file. 3

4 Applying EBF15328 to the Informatica Domain Apply the EBF to every node in the domain that is used to connect to HDFS or HiveServer on MapR or To apply the EBF to a node in the domain, perform the following steps: 1. Copy EBF15328_Server_installer_linux_em64t.tar to a temporary location on the node. 2. Extract the installer file. Run the following command: tar -xvf EBF15328_Server_Installer_linux_em64t.tar 3. Configure the following properties in the Input.properties file: DEST_DIR=<Informatica installation directory> ROLLBACK=0 4. Run installebf.sh. 5. Repeat steps 1 through 4 for every node in the domain that is used for Hive pushdown. Note: To roll back the EBF for the Informatica domain on a node, set ROLLBACK to 1 and run installebf.sh. Configuring MapR Distribution Variables for Mappings in a Hive Environment When you use the MapR distribution to run mappings in a Hive environment, you must configure MapR environment variables. Configure the following MapR variables: Add MAPR_HOME to the environment variables in the Data Integration Service Process properties. Set MAPR_HOME to the following path: <Informatica installation directory>/services/shared/hadoop/ mapr_4.0.2_classic. Add -Dmapr.library.flatclass to the custom properties in the Data Integration Service Process properties. For example, add JVMOption1=-Dmapr.library.flatclass Add -Dmapr.library.flatclass to the Data Integration Service advanced property JVM Command Line Options. Set the MapR Container Location Database name variable CLDB in the following file: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/conf/mapr-clusters.conf. For example, add the following property: INFAMAPR401 secure=false <master_node_name>:7222 Configuring hive-site.xml You must configure the cluster properties in hive-site.xml on the machine on where Data Integration Service runs. hive-site.xml is located in the following directory on the machine on which the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/<hadoop distribution_name>/conf/. In hive-site.xml, configure the following properties: hive.metastore.execute.setugi Enables the Hive metastore server to use the client's user and group permissions. Set the value to true. The following sample code shows the property you can configure in hive-site.xml: <property> <name>hive.metastore.execute.setugi</name> <value>true</value> </property> 4

5 hive.cache.expr.evaluation Whether Hive enables the optimization to convert a common join into a mapjoin based on the input file size. The value must be set to false due to a bug in the optimization feature for hive For more information, see the following JIRA entry: The following sample code shows the property you can configure in hive-site.xml: <property> <name>hive.cache.expr.evaluation</name> <value>false</value> <description>whether Hive enables the optimization to convert a common join into a mapjoin based on the input file size.</description> </property> Configuring the hadoopenv.properties File for MapR To enable MapR 4.0.1, you must configure the values in the hadoopenv.properties file on each node that runs a Data Integration Service used for Hadoop pushdown. Note: If you are updating the Informatica domain to enable MapR 4.0.2, skip this task. To configure the hadoopenv.properties file, perform the following tasks: 1. Navigate to the following directory: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/infaconf/. 2. Edit the hadoopenv.properties file. 3. Delete the following paths from infapdo.env.entry.mapred_classpath : $HADOOP_NODE_HADOOP_DIST/lib/htrace-core-2.04.jar $HADOOP_NODE_HADOOP_DIST/lib/hbase-server mapr-1501.jar $HADOOP_NODE_HADOOP_DIST/lib/protobuf-java jar $HADOOP_NODE_HADOOP_DIST/lib/hbase-client mapr-1501.jar $HADOOP_NODE_HADOOP_DIST/lib/hbase-common mapr-1501.jar $HADOOP_NODE_HADOOP_DIST/lib/hive-hbase-handler mapr-1501.jar $HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol mapr-1501.jar 4. Add the following path to infapdo.env.entry.mapred_classpath: /opt/mapr/hadoop/hadoop /lib/*:/opt/mapr/hive/hive-0.13/lib/* Note: The /opt/mapr/hadoop path is the <Hadoop_HOME> on the cluster. 5. Find the following path in infapdo.env.entry.ld_library_path: $HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd Replace the path with the following path: /opt/mapr/hadoop/hadoop-2.4.1/lib/native/* 7. Find the following path in -Djava.library.path: $HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd Replace the path with the following path: /opt/mapr/hadoop/hadoop-2.4.1/lib/native/*:/opt/mapr/hadoop/hadoop /lib/* 9. Repeat steps 1 through 8 for each node that runs a Data Integration Service used for Hadoop pushdown. 5

6 Step 3. Update the Hadoop Cluster To update the Hadoop cluster to enable MapR or 4.0.2, perform the following tasks: 1. Apply the EBF to the Hadoop cluster 2. Configure the hive-site.xml file 3. Configure the heap space for the MapR-FS 4. To enable support for MapR 4.0.1, configure the hadoopres.properties file 5. Verify the cluster details Applying EBF15328 to the Hadoop Cluster To apply the EBF to the Hadoop cluster, perform the following steps: 1. Copy EBF15328_HadoopRPM_EBFInstaller.tar to a temporary location on the cluster machine. 2. Extract the installer file. Run the following command: tar -xvf EBF15328_HadoopRPM_EBFInstaller.tar 3. Provide the node list in the HadoopDataNodes file. 4. Configure the destdir parameter in the input.properties file: destdir=<informatica home directory> For example, set the DEST_DIR parameter to the following value: destdir="/opt/informatica" 5. Run InformaticaHadoopEBFInstall.sh. Configuring a Hive Metastore in hive-site.xml You must configure the Hive metastore property in hive-site.xml that grants permissions to the Data Integration Service to perform operations on the Hive metastore. You must configure the property in hive-site.xml on the Hadoop cluster nodes. hive-site.xml is located in the following directory on every Hadoop cluster node: <Hadoop_NODE_INFA_HOME>/ services/shared/hadoop/mapr_4.0.2_classic. In hive-site.xml, configure the following property: hive.metastore.execute.setugi Enables the Hive metastore server to use the client's user and group permissions. Set the value to true. The following sample code shows the property you can configure in hive-site.xml: <property> <name>hive.metastore.execute.setugi</name> <value>true</value> </property> Configuring the Heap Space for the MapR-FS You must configure the heap space reserved for the MapR-FS specified in the warden.conf file on every node in the cluster. Perform the following tasks: 1. Navigate to the following directory: /opt/mapr/conf. 6

7 2. Edit the warden.conf file. 3. Set the value for the service.command.mfs.heapsize.percent property to Save and close the file. 5. Repeat steps 1 through 4 for every node in the cluster. 6. Restart the cluster. Configuring the hadoopres.properties File for MapR To enable MapR 4.0.1, you must configure the classpath variable in the hadoopres.properties file on every node in the Hadoop cluster that is used for Hive pushdown. Note: If you are updating the Hadoop cluster to enable MapR 4.0.2, skip this task. 1. Navigate to the following directory on a node in the Hadoop cluster: <HADOOP_NODE_INFA_HOME>/services/ shared/hadoop/mapr_4.0.2_classic/infaconf. 2. Edit the hadoopres.properties file. 3. Find the infahdp.hadoop.classpath variable. 4. Delete the following path from the variable: $HADOOP_DIST/lib/*&: 5. Save and close the file. 6. Repeat steps 1 through 5 for every node in the Hadoop cluster. Verifying the Cluster Details Verify the following settings for the MapR cluster: MapReduce Version If the cluster is configured for YARN, use the MapR Control System (MCS) to change the configuration to MRv1 Classic. Then, restart the cluster. MapR User Details Verify that the MapR user exists on each Hadoop cluster node and that the following properties match: User ID (uid) Group ID (gid) Groups For example, the MapR user might have the following properties: uid=2000(mapr) gid=2000(mapr) groups=2000(mapr) Data Integration Service User Details Verify that the user who runs the Data Integration Service is assigned the same gid as the MapR user and belongs to the same group. For example, a Data Integration Service user named testuser, might have the following properties: uid=30103(testuser) gid=2000(mapr) 7

8 groups=2000(mapr) After you verify the Data Integration Service user details, perform the following steps: 1. Create a user that has the same user ID and name as the Data Integration Service user. 2. Add this user to all the nodes in the Hadoop cluster and assign it to the mapr group. 3. Verify that the user you created has read and write permissions for the following directory: /opt/mapr/ hive/hive-0.13/logs. A directory corresponding to the user will be created at this location. 4. Verify that the user you created has permissions for the Hive warehouse directory. The Hive warehouse directory is set in the following file: /opt/mapr/hive/hive-0.13/conf/hivesite.xml. For example, if the warehouse directory is /user/hive/warehouse, run the following command to grant the user permissions for the directory: hadoop fs chmod R 777 /user/hive/warehouse 5. Verify that the user you created has permission to create staging tables. The default MapR-FS staging directory is /var/mapr/cluster/mapred/jobtracker/staging. 6. Verify that the directory specified in the mapred.local.dir property in the /opt/mapr/hadoop/ hadoop /conf file exists in the MapR-FS. The default value is /tmp/mapr-hadoop/mapred/local. If the directory does not exist, create it and give full permissions to the MapR user and the user you created. Step 4. Update the Developer Tool Update the Informatica clients to enable MapR or Perform the following tasks: 1. Apply the EBF to the Informatica clients 2. Configure the Developer tool Applying EBF15328 to the Informatica Client To apply the EBF to the Informatica client, perform the following steps: 1. Copy EBF15328_Client_Installer_win32_x86 to the Windows client machine. 2. Extract the installer. 3. Configure the following properties in the Input.properties file: DEST_DIR=<Informatica installation directory> ROLLBACK=0 Use two slashes when you set the DEST_DIR property. For example, include the following lines in the Input.properties file: DEST_DIR=C:\\Informatica\\9.6.1HF2RC ROLLBACK=0 4. Run installebf.bat. Note: To roll back the EBF for the Informatica client, set ROLLBACK to 1 and run installebf.bat. 8

9 Configuring the Developer Tool To configure the Developer tool after you apply the EBF, perform the following steps: 1. Go to the following directory on any node in the Hadoop cluster: <MapR installation directory>/conf. 2. Find the mapr-cluster.conf file. 3. Copy the file to the following directory on the machine on which the Developer tool runs: <Informatica installation directory>\clients\developerclient\hadoop\mapr_client_4.0.1_beta\conf 4. Go to the following directory on the machine on which the Developer tool runs: <Informatica installation directory>\<version>\clients\developerclient 5. Edit run.bat to include the MAPR_HOME environment variable and the -clean settings: For example, include the following lines: <Informatica installation directory>\clients\developerclient\hadoop \mapr_client_4.0.1_beta developercore.exe -clean 6. Add the following values to the developercore.ini file: -Dmapr.library.flatclass -Djava.library.path=hadoop\mapr_client_4.0.1_beta\lib\native\Win32;bin;..\DT\bin You can find developercore.ini in the following directory: <Informatica installation directory> \clients\developerclient 7. Use run.bat to start the Developer tool. Step 5. Update PowerCenter Update the Informatica domain to enable MapR or Perform the following tasks: 1. Update the repository plugin 2. Configure the PowerCenter Integration Service 3. Copy MapR distribution files to PowerCenter mappings in the native environment Updating the Repository Plugin To enable PowerExchange for HDFS to run on the Hadoop distribution, update the repository plugin. Perform the following steps: 1. Ensure that the Repository service is running in exclusive mode. 2. On the server machine, open the command console. 3. Run cd <Informatica installation directory>/server/bin. 4. Run./pmrep connect -r <repo_name> -d <domain_name> -n <username> -x <password>. 5. Run./pmrep registerplugin -i native/pmhdfs.xml -e -N true. 6. Set the Repository service to normal mode. 7. Open the PowerCenter Workflow manager on the client machine. The distribution appears in the Connection Object menu. 9

10 Configuring the PowerCenter Integration Service To enable support for MapR or 4.0.2, configure the PowerCenter Integration Service. Perform the following steps: 1. Log in to the Administrator tool. 2. In the Domain Navigator, select the PowerCenter Integration Service. 3. Click the Processes view. 4. Add the following environment variable: MAPR_HOME Use the following value: <INFA_HOME>/server/bin/javalib/hadoop/mapr Add the following custom property: JVMClassPath Use the following value: <INFA_HOME>/server/bin/javalib/hadoop/mapr402/*:<INFA_HOME>/ server/bin/javalib/hadoop/* Copying MapR Distribution Files for PowerCenter Mappings in the Native Environment When you use the MapR distribution to run mappings in a native environment, you must copy MapR files to the machine on which you install Big Data Edition. 1. Go to the following directory on any node in the cluster: <MapR installation directory>/conf. For example, go to the following directory on any node in the cluster: /opt/mapr/conf. 2. Find the following files: mapr-cluster.conf mapr.login.conf 3. Copy the files to the following directory to the machine on which the Data Integration Service runs: <Informatica installation directory>/server/bin/javalib/hadoop/mapr402/conf. 4. Log in to the Administrator tool. 5. In the Domain Navigator, select the PowerCenter Integration Service. 6. Recycle the service. Click Actions > Recycle Service. Enable User Impersonation for Native and Hive Execution Environments User impersonation allows the Data Integration Service to submit Hadoop jobs as a specific user. By default, Hadoop jobs are submitted with the user who runs the Data Integration Service. To enable user impersonation for the native and Hive environments, perform the following steps: 1. Go to the following directory on the machine on which the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/conf 2. Create a directory named "proxy". 10

11 Run the following command: mkdir <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/ conf/proxy 3. Change the permissions for the proxy directory to -rwxr-xr-x. Run the following command: chmod 755 <Informatica installation directory>/services/shared/hadoop/ mapr_4.0.2_classic/conf/proxy 4. Verify the following details for the user that you want to impersonate with the Data Integration Service user: Exists on the machine on which the Data Integration Service runs Exists on every node in the Hadoop cluster Has the same user-id and group-id on machine on which the Data Integration Service runs as well as the Hadoop cluster. 5. Create a file for the Data Integration Service user that impersonates other users. Run the following command: touch <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/ conf/proxy/<username> For example, to create a file for the Data Integration Service user named user1 that is used to impersonate other users, run the following command: touch $INFA_HOME/services/shared/hadoop/mapr_4.0.2_classic/conf/proxy/user1 6. Log in to the Administrator tool. 7. In the Domain Navigator, select the Data Integration Service. 8. Recycle the Data Integration Service. Click Actions > Recycle Service. Enable Hive Pushdown for Hbase EBF15328 supports Hbase version To enable Hive pushdown for Hbase, perform the following steps: 1. Add hbase-protocol mapr-1501.jar to the Hadoop classpath on every node of the Hadoop cluster. 2. Restart the Node Manager for each node. 3. Go to the following directory on the machine on which the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_classic/infaconf. 4. Edit the hadoopenv.properties file. 5. Add the following paths to the infapdo.aux.jars.path variable: file://$dis_hadoop_dist/lib/hbase-client mapr-1501.jar,file:// $DIS_HADOOP_DIST/lib/hbase-common mapr-1501.jar,file://$DIS_HADOOP_DIST/lib/ htrace-core-2.04.jar,file://$dis_hadoop_dist/lib/protobuf-java jar,file:// $DIS_HADOOP_DIST/lib/hbase-server mapr-1501.jar 6. Log in to the Administrator tool. 7. In the Domain Navigator, select the Data Integration Service. 8. Recycle the Data Integration Service. Click Actions > Recycle Service. 11

12 Connections Overview Define the connections you want to use to access data in Hive or HDFS. You can create the following types of connections: HDFS connection. Create an HDFS connection to read data from or write data to the Hadoop cluster. HBase connection. Create an HBase connection to access HBase. The HBase connection is a NoSQL connection Hive connection. Create a Hive connection to access Hive data or run Informatica mappings in the Hadoop cluster. Create a Hive connection in the following connection modes: - Use the Hive connection to access Hive as a source or target. If you want to use Hive as a target, you need to have the same connection or another Hive connection that is enabled to run mappings in the Hadoop cluster. You can access Hive as a source if the mapping is enabled for the native or Hive environment. You can access Hive as a target only if the mapping is run in the Hadoop cluster. - Use the Hive connection to validate or run an Informatica mapping in the Hadoop cluster. Before you run mappings in the Hadoop cluster, review the information in this guide about rules and guidelines for mappings that you can run in the Hadoop cluster. You can create the connections using the Developer tool, Administrator tool, and infacmd. Note: For information about creating connections to other sources or targets such as social media web sites or Teradata, see the respective PowerExchange adapter user guide for information. HDFS Connection Properties Use a Hadoop File System (HDFS) connection to access data in the Hadoop cluster. The HDFS connection is a file system type connection. You can create and manage an HDFS connection in the Administrator tool, Analyst tool, or the Developer tool. HDFS connection properties are case sensitive unless otherwise noted. Note: The order of the connection properties might vary depending on the tool where you view them. The following table describes HDFS connection properties: Property Name Name of the connection. The name is not case sensitive and must be unique within the domain. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / ID Location Type User Name NameNode URI String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 765 characters. The domain where you want to create the connection. Not valid for the Analyst tool. The connection type. Default is Hadoop File System. User name to access HDFS. The URI to access MapR-FS. Use the following URI: maprfs:/// 12

13 HBase Connection Properties Use an HBase connection to access HBase. The HBase connection is a NoSQL connection. You can create and manage an HBase connection in the Administrator tool or the Developer tool. Hbase connection properties are case sensitive unless otherwise noted. The following table describes HBase connection properties: Property Name ID The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 4,000 characters. Location Type ZooKeeper Host(s) ZooKeeper Port Enable Kerberos Connection The domain where you want to create the connection. The connection type. Select HBase. Name of the machine that hosts the ZooKeeper server. Port number of the machine that hosts the ZooKeeper server. Use the value specified for hbase.zookeeper.property.clientport in hbase-site.xml. You can find hbase-site.xml on the Namenode machine in the following directory: /opt/mapr/hbase/hbase /conf Enables the Informatica domain to communicate with the HBase master server or region server that uses Kerberos authentication. 13

14 Property HBase Master Principal Service Principal Name (SPN) of the HBase master server. Enables the ZooKeeper server to communicate with an HBase master server that uses Kerberos authentication. Enter a string in the following format: hbase/<domain.name>@<your-realm> Where: - domain.name is the domain name of the machine that hosts the HBase master server. - YOUR-REALM is the Kerberos realm. HBase Region Server Principal Service Principal Name (SPN) of the HBase region server. Enables the ZooKeeper server to communicate with an HBase region server that uses Kerberos authentication. Enter a string in the following format: hbase_rs/<domain.name>@<your-realm> Where: - domain.name is the domain name of the machine that hosts the HBase master server. - YOUR-REALM is the Kerberos realm. Hive Connection Properties Use the Hive connection to access Hive data. A Hive connection is a database type connection. You can create and manage a Hive connection in the Administrator tool, Analyst tool, or the Developer tool. Hive connection properties are case sensitive unless otherwise noted. Note: The order of the connection properties might vary depending on the tool where you view them. The following table describes Hive connection properties: Property Name ID Location Type The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 4000 characters. The domain where you want to create the connection. Not valid for the Analyst tool. The connection type. Select Hive. 14

15 Property Connection Modes User Name Common Attributes to Both the Modes: Environment SQL Hive connection mode. Select at least one of the following options: - Access Hive as a source or target. Select this option if you want to use the connection to access the Hive data warehouse. If you want to use Hive as a target, you must enable the same connection or another Hive connection to run mappings in the Hadoop cluster. - Use Hive to run mappings in Hadoop cluster. Select this option if you want to use the connection to run mappings in the Hadoop cluster. You can select both the options. Default is Access Hive as a source or target. User name of the user that the Data Integration Service impersonates to run mappings on a Hadoop cluster. Use the user name of an operating system user that is present on all nodes on the Hadoop cluster. SQL commands to set the Hadoop environment. In native environment type, the Data Integration Service executes the environment SQL each time it creates a connection to a Hive metastore. If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration Service executes the environment SQL at the beginning of each Hive session. The following rules and guidelines apply to the usage of environment SQL in both connection modes: - Use the environment SQL to specify Hive queries. - Use the environment SQL to set the classpath for Hive user-defined functions and then use environment SQL or PreSQL to specify the Hive user-defined functions. You cannot use PreSQL in the data object properties to specify the classpath. The path must be the fully qualified path to the JAR files used for user-defined functions. Set the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the path to the JAR files for user-defined functions. - You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries. If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration service executes only the environment SQL of the Hive connection. If the Hive sources and targets are on different clusters, the Data Integration Service does not execute the different environment SQL commands for the connections of the Hive source or target. 15

16 Properties to Access Hive as Source or Target The following table describes the connection properties that you configure to access Hive as a source or target: Property Metadata Connection String Bypass Hive JDBC Server Data Access Connection String The JDBC connection URI used to access the metadata from the Hadoop server. You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service. To connect to HiveServer2, specify the connection string in the following format: jdbc:hive2://<hostname>:<port>/<db> Where - <hostname> is name or IP address of the machine on which HiveServer2 runs. - <port> is the port number on which HiveServer2 listens. - <db> is the database name to which you want to connect. If you do not provide the database name, the Data Integration Service uses the default database details. JDBC driver mode. Select the check box to use the embedded JDBC driver mode. To use the JDBC embedded mode, perform the following tasks: - Verify that Hive client and Informatica services are installed on the same machine. - Configure the Hive connection properties to run mappings in the Hadoop cluster. If you choose the non-embedded mode, you must configure the Data Access Connection String. Informatica recommends that you use the JDBC embedded mode. The connection string to access data from the Hadoop data store. To connect to HiveServer2, specify the non-embedded JDBC mode connection string in the following format: jdbc:hive2://<hostname>:<port>/<db> Where - <hostname> is name or IP address of the machine on which HiveServer2 runs. - <port> is the port number on which HiveServer2 listens. - <db> is the database to which you want to connect. If you do not provide the database name, the Data Integration Service uses the default database details. Properties to Run Mappings in Hadoop Cluster The following table describes the Hive connection properties that you configure when you want to use the Hive connection to run Informatica mappings in the Hadoop cluster: Property Database Name Default FS URI Namespace for tables. Use the name default for tables that do not have a specified database name. The URI to access the default MapR File System. Use the following connection URI: maprfs:/// 16

17 Property Yarn Resource Manager URI The service within Hadoop that submits the MapReduce tasks to specific nodes in the cluster. For MapR with YARN, use the following format: <hostname>:<port> Where - <hostname> is the host name or IP address of the JobTracker or Yarn resource manager. - <port> is the port on which the JobTracker or Yarn resource manager listens for remote procedure calls (RPC). Use the value specified by yarn.resourcemanager.address in yarnsite.xml. You can find yarn-site.xml in the following directory on the NameNode: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop. For MapR with MapReduce 1, use the following URI: maprfs:/// Hive Warehouse Directory on HDFS Advanced Hive/Hadoop Properties Temporary Table Compression Codec The absolute HDFS file path of the default database for the warehouse that is local to the cluster. For example, the following file path specifies a local warehouse: /user/hive/warehouse If the Metastore Execution Mode is remote, then the file path must match the file path specified by the Hive Metastore Service on the hadoop cluster. Use the value specified for the hive.metastore.warehouse.dir property in hive-site.xml. You can find yarn-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. Configures or overrides Hive or Hadoop cluster properties in hive-site.xml on the machine on which the Data Integration Service runs. You can specify multiple properties. Use the following format: <property1>=<value> Where - <property1> is a Hive or Hadoop property in hive-site.xml. - <value> is the value of the Hive or Hadoop property. To specify multiple properties use &: as the property separator. The maximum length for the format is 1 MB. If you enter a required property for a Hive connection, it overrides the property that you configure in the Advanced Hive/Hadoop Properties. The Data Integration Service adds or sets these properties for each map-reduce job. You can verify these properties in the JobConf of each mapper and reducer job. Access the JobConf of each job from the Jobtracker URL under each mapreduce job. The Data Integration Service writes messages for these properties to the Data Integration Service logs. The Data Integration Service must have the log tracing level set to log each row or have the log tracing level set to verbose initialization tracing. For example, specify the following properties to control and limit the number of reducers to run a mapping job: mapred.reduce.tasks=2&:hive.exec.reducers.max=10 Hadoop compression library for a compression codec class name. 17

18 Property Codec Class Name Metastore Execution Mode Metastore Database URI Metastore Database Driver Metastore Database Username Metastore Database Password Remote Metastore URI Codec class name that enables data compression and improves performance on temporary staging tables. Controls whether to connect to a remote metastore or a local metastore. By default, local is selected. For a local metastore, you must specify the Metastore Database URI, Driver, Username, and Password. For a remote metastore, you must specify only the Remote Metastore URI. The JDBC connection URI used to access the data store in a local metastore setup. Use the following connection URI: jdbc:<datastore type>://<node name>:<port>/<database name> where - <node name> is the host name or IP address of the data store. - <data store type> is the type of the data store. - <port> is the port on which the data store listens for remote procedure calls (RPC). - <database name> is the name of the database. For example, the following URI specifies a local metastore that uses MySQL as a data store: jdbc:mysql://hostname23:3306/metastore Use the value specified for the javax.jdo.option.connectionurl property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. Driver class name for the JDBC data store. For example, the following class name specifies a MySQL driver: Use the value specified for the javax.jdo.option.connectiondrivername property in hivesite.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. The metastore database user name. Use the value specified for the javax.jdo.option.connectionusername property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/ conf. Required if the Metastore Execution Mode is set to local. The password for the metastore user name. Use the value specified for the javax.jdo.option.connectionpassword property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/ conf. The metastore URI used to access metadata in a remote metastore setup. For a remote metastore, you must specify the Thrift server details. Use the following connection URI: thrift://<hostname>:<port> Where - <hostname> is name or IP address of the Thrift metastore server. - <port> is the port on which the Thrift server is listening. Use the value specified for the hive.metastore.uris property in hivesite.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. 18

19 Creating a Connection Create a connection before you import data objects, preview data, profile data, and run mappings. 1. Click Window > Preferences. 2. Select Informatica > Connections. 3. Expand the domain in the Available Connections list. 4. Select the type of connection that you want to create: To select a Hive connection, select Database > Hive. To select an HDFS connection, select File Systems > Hadoop File System. 5. Click Add. 6. Enter a connection name and optional description. 7. Click Next. 8. Configure the connection properties. For a Hive connection, you must choose the Hive connection mode and specify the commands for environment SQL. The SQL commands appy to both the connection modes. Select at least one of the following connection modes: Option Access Hive as a source or target Run mappings in a Hadoop cluster. Use the connection to access Hive data. If you select this option and click Next, the Properties to Access Hive as a source or target page appears. Configure the connection strings. Use the Hive connection to validate and run Informatica mappings in the Hadoop cluster. If you select this option and click Next, the Properties used to Run Mappings in the Hadoop Cluster page appears. Configure the properties. 9. Click Test Connection to verify the connection. You can test a Hive connection that is configured to access Hive data. You cannot test a Hive connection that is configured to run Informatica mappings in the Hadoop cluster. 10. Click Finish. Known Limitations The following table describes known limitations: CR The nanoseconds portion of the timestamp column is corrupted when the following conditions are true: - The mapping contains a relational source that has a timestamp column. - The mapping contains a relational target that has a timestamp column. - The mapping is run in the Hive environment. Only three bits are supported. Author Big Data Edition Team 19