Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster 2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.
Abstract This document talks about configuring Informatica Data Vault to work with a Cloudera Hadoop cluster. Some of the Data Vault configurations mentioned in this document may also be used to work with other Hadoop distributions. However, the Cloudera configurations are strictly for use with the Cloudera distribution of Hadoop. In this document, we assumed the Linux distribution as RedHat Enterprise Linux 6. Installation of the Hadoop client might differ on other distributions. Supported Versions Informatica Data Vault (File Archive Service) 6.1.1 Table of Contents Overview... 3 Architecture... 3 Install Hadoop Client... 3 Step 1. Create a Yum Repository... 3 Step 2. Install Hadoop Client Package Using Yum... 4 Configure Hadoop Client... 4 Step 1. Modify core-site.xml... 4 Step 2. Test Hadoop Client Configuration... 4 Configure Informatica Data Vault... 5 Configure Environment... 5 Step 1. Modify.bash_profile... 5 Step 2. Load the Informatica Data Vault Environment... 6 Step 3. Start Informatica Data Vault Service... 6 Step 4. Push a Test sct file to Cloudera Hadoop Cluster... 6 Step 5. Test a Query on Hadoop... 6 2
Overview The Cloudera Hadoop cluster is a high performance, load balanced cluster and most customers do not like installing software on any machine that is part of the cluster. This document talks about how to configure a different box that hosts Informatica Data Vault to work with the Hadoop cluster. Architecture The box that connects to the Hadoop cluster can host the Informatica Data Archive, Informatica Data Vault and the Cloudera Hadoop client. The recommended configuration for this box is at least 4 cores and 32 Gigabytes of RAM. Informatica Data Vault can communicate with the Hadoop cluster using the Cloudera Hadoop client. Other open source versions of Hadoop software are available from Apache. However, it has been observed that the Hadoop version that is available as open source is lower than the Cloudera Hadoop cluster s version and there have always been problems configuring the open source software to work with the Cloudera distribution. The supported Cloudera Distribution of Hadoop is CDH 4.x. Figure 1. Recommended Architecture Recommended Architecture Install Hadoop Client The recommended way of installing the Cloudera Hadoop client is using a yum tool. Configuring and installing any package through yum requires you to be a superuser on the Linux box. The following steps will help you install the Cloudera Hadoop client using yum. Step 1. Create a Yum Repository Create the Cloudera cdh4 repo file under /etc/yum.repos.d using the following command: # echo [cloudera-cdh4] name = Cloudera CDH, Version 4 baseurl = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/ gpgkey = http://archive.cloudera.com/redhat/cdh/rpm-gpg-keycloudera 3
gpgcheck = 1 > /etc/yum.repos.d/cloudera-cdh4.repo This allows the yum tool to download the Cloudera Hadoop client from the Cloudera repository and all its dependencies. Step 2. Install Hadoop Client Package Using Yum Install the Cloudera Hadoop client using the following command: # yum -y install hadoop-client The process can take a while to complete but at the end of the process you should be able to check the Hadoop version using the following command: # hadoop version Configure Hadoop Client Step 1. Modify core-site.xml Hadoop s configuration files are installed under /etc/hadoop/conf. The core-site.xml file contains configuration information that overrides the default values for core Hadoop properties. You need to modify the core-site.xml to look like the following snippet: <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description> A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://<namenode_name_or_ipaddress>:<port></value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> The default port for Cloudera Hadoop cluster s HDFS service on NameNode is 8020. Make sure that this port is open in the firewall of Cloudera Hadoop cluster s NameNode and the Cloudera Hadoop cluster s NameNode is configured to run using an IP address or hostname that is accessible outside the host. Step 2. Test Hadoop Client Configuration To test if the Hadoop client configuration is alright, you can run the following command as any user: $ hadoop fs -ls / If the result of the above command returns with a list of available directories in Hadoop, the Hadoop client configuration is successful. If there is any error, verify that the Hadoop client version is not lower than the Cloudera Hadoop cluster, or check to see if you can connect to the host and port specified in core-site.xml using the following command: $ telnet <namenode_name_or_ipaddress> <port> 4
Telnet has been termed obsolete and is not installed automatically on the latest Linux distributions. Hence, we might need to install telnet using the following command as superuser: # yum -y install telnet Configure Informatica Data Vault When making a new installation of Informatica Data Vault, in the Advanced Configuration section, change the value of Maximum VMEM to 20480 (indicates 20 G). For existing installations, you need to modify this property in ssa.ini Data Vault configuration file. On Linux, the number of agents that start automatically with the Data Vault Service is two. You need to have at least four agents for the loader to not crash loading files into Hadoop cluster. You need to add a section in the ssa.ini configuration file that describes the Hadoop connection. For this you need to change the ssa.ini Data Vault configuration file. The following snippet shows the sections that need to be added or edited with the parameters that require modifications bolded: [QUERY] THREADS=2 MAXVMEM=20480 MEMORY=512 TEMPDIR=/home/hadoop/ILM-FAS/temp SHAREDIR=/home/hadoop/ILM-FAS/temp [STARTER] AGENT_CONTROL=1 AGENT_COUNT=4 VERBOSE=2 SERVER_CONTROL=1 AGENT_CMD=ssaagent SERVER_CMD=ssaserver #EXE0=ssaservice start LOGDIR=/home/hadoop/ILM-FAS/fas_logs [HADOOP_CONNECTION cloudera] URL = ilmaustin14 PORT = 8020 Configure Environment Step 1. Modify.bash_profile Add the following lines to your.bash_profile file to allow Informatica Data Vault to read required libraries to access the Hadoop cluster: LD_LIBRARY_PATH=/usr/java/jdk1.7.0_21/jre/lib/amd64/server:/usr/lib64:$ LD_LIBRARY_PATH;export LD_LIBRARY_PATH CLASSPATH=/usr/lib/hadoop/hadoop-common.jar:/usr/lib/hadoop/hadoopannotations.jar:/usr/lib/hadoop/hadoopauth.jar:/usr/lib/hadoop/lib/commons-logging- 5
1.1.1.jar:/usr/lib/hadoop/lib/commons-lang- 2.5.jar:/usr/lib/hadoop/lib/commons-configuration- 1.6.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:/usr/lib/hadoop/lib/slf4j- api-1.6.1.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar:/usr/lib/hadoop/lib/log4j-1.2.17.jar:/usr/lib/hadoop- hdfs/hadoop-hdfs.jar:/usr/lib/hadoop/lib/commons-cli- 1.2.jar:/usr/lib/hadoop/lib/protobuf-java- 2.4.0a.jar:/usr/lib/hadoop/lib/commons-io-2.1.jar;export CLASSPATH Step 2. Load the Informatica Data Vault Environment Informatica Data Vault installs with a pre-configured script that can be used to load all the environment variables that are required by the Informatica Data Vault components. The script file is located in the Informatica Data Vault installation directory. You need to source this preconfigured script using the following command: $. ssaenv.sh Step 3. Start Informatica Data Vault Service There are different ways to start the Informatica Data Vault Server and its associated services. However, the most recommended way is a single command start which will load all the required services and start the number of agents mentioned in the configuration: $ ssa_starter -r & Step 4. Push a Test sct file to Cloudera Hadoop Cluster You can push a test sct file into Cloudera Hadoop cluster by running the following command: $ ssadrv -imp address_a.sct hdfs://cloudera/user Step 5. Test a Query on Hadoop You can test if you can query the sct file that is loaded into Hadoop by running the following command: $ ssau -q hdfs://cloudera//user/address_a.sct Authors Seetharama Khandrika Lead Software Developer Acknowledgements To construct this document, we have used a few references from Apache s web site and used the Cloudera free Hadoop distribution to know all the dependencies. The jars listed for the CLASSPATH variables would change based on the Hadoop version. 6