Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster



Similar documents
Configuring Hadoop Distributed File Service as an Optimized File Archive Store

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

CDH 5 Quick Start Guide

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Revolution R Enterprise 7 Hadoop Configuration Guide

Configuring TLS Security for Cloudera Manager

PMOD Installation on Linux Systems

FUJITSU Cloud IaaS Trusted Public S5 Setup and Configure yum Software Package Manager with CentOS 5.X/6.X VMs

Quick Start Guide For Ipswitch Failover v9.0

Single Node Hadoop Cluster Setup

Secure Agent Quick Start for Windows

VMware vsphere Big Data Extensions Administrator's and User's Guide

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Provider's Guide to Integrating Parallels Presence Builder 12 with Parallels Automation

Cloudera Navigator Installation and User Guide

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

How To Install Hadoop From Apa Hadoop To (Hadoop)

Cloudera Manager Training: Hands-On Exercises

ORACLE GOLDENGATE BIG DATA ADAPTER FOR HIVE

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Partek Flow Installation Guide

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing

MySQL and Virtualization Guide

18.2 user guide No Magic, Inc. 2015

RHadoop Installation Guide for Red Hat Enterprise Linux

Cloudera Manager Introduction

Data Domain Profiling and Data Masking for Hadoop

Configure an ODBC Connection to SAP HANA

Connect to an SSL-Enabled Microsoft SQL Server Database from PowerCenter on UNIX/Linux

SAS Data Loader 2.1 for Hadoop

Cloudera Backup and Disaster Recovery

JAMF Software Server Installation Guide for Linux. Version 8.6

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

HADOOP CLUSTER SETUP GUIDE:

How to Run Spark Application

Hadoop Lab - Setting a 3 node Cluster. Java -

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

IBM Endpoint Manager Version 9.1. Patch Management for Red Hat Enterprise Linux User's Guide

Installation Guide. Copyright (c) 2015 The OpenNMS Group, Inc. OpenNMS SNAPSHOT Last updated :19:20 EDT

Deploy and Manage Hadoop with SUSE Manager. A Detailed Technical Guide. Guide. Technical Guide Management.

NexentaConnect for VMware Virtual SAN

Configuring Apache HTTP Server With Pramati

Configure Managed File Transfer Endpoints

OpenGeo Suite for Linux Release 3.0

Virtual Managment Appliance Setup Guide

EMC Documentum Content Management Interoperability Services

Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide

Running a Workflow on a PowerCenter Grid

Cloudera Backup and Disaster Recovery

SWsoft Plesk 8.3 for Linux/Unix Backup and Restore Utilities

Setting up VMware Server v1 for 2X VirtualDesktopServer Manual

Connection Broker The Leader in Managing Hosted Desktop Infrastructures and Virtual Desktop Infrastructures (HDI and VDI) DNS Setup Guide

Hadoop Setup Walkthrough

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

SOA Software: Troubleshooting Guide for WebSphere Application Server Agent

HADOOP MOCK TEST HADOOP MOCK TEST II

Configuration Guide. SafeNet Authentication Service AD FS Agent

BEAWebLogic. Portal. WebLogic Portlets for SAP Installation Guide

Volume SYSLOG JUNCTION. User s Guide. User s Guide

Virtual Web Appliance Setup Guide

Click Stream Data Analysis Using Hadoop

AT&T Synaptic Compute as a Service SM

BrightStor ARCserve Backup for Linux

ORACLE GOLDENGATE BIG DATA ADAPTER FOR FLUME

RMFT Outlook Add-In User Guide

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

High Availability of the Polarion Server

ichain Novell Welcome to ichain 2.2 SYSTEM REQUIREMENTS QUICK START

Canto Integration Platform (CIP)

Installing and Using the Zimbra Reporting Tool

The Greenplum Analytics Workbench

SAIP 2012 Performance Engineering

SWsoft Plesk 8.2 for Linux/Unix Backup and Restore Utilities. Administrator's Guide

File S1: Supplementary Information of CloudDOE

Creating a Secure Web Service In Informatica Data Services

Change Manager 5.0 Installation Guide

Configuring MailArchiva with Insight Server

Configuring Notification for Business Glossary

HC INSTALLATION GUIDE. For Linux. Hosting Controller All Rights Reserved.

MDM Multidomain Edition (Version 9.6.0) For Microsoft SQL Server Performance Tuning

CDH installation & Application Test Report

Revolution R Enterprise 7 Hadoop Configuration Guide

Important Notice. (c) Cloudera, Inc. All rights reserved.

Application Management A CFEngine Special Topics Handbook

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

IBM Endpoint Manager Version 9.2. Patch Management for SUSE Linux Enterprise User's Guide

Red Hat JBoss Core Services Apache HTTP Server 2.4 Apache HTTP Server Installation Guide

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Application Note VAST Network settings

Endpoint web control overview guide. Sophos Web Appliance Sophos Enterprise Console Sophos Endpoint Security and Control

24x7 Scheduler Multi-platform Edition 5.2

Apache Whirr (Incubating) Open Source Cloud Services

Symantec LiveUpdate Administrator. Getting Started Guide

Cloudera ODBC Driver for Apache Hive Version

Parallels Cloud Server 6.0

Transcription:

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster 2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract This document talks about configuring Informatica Data Vault to work with a Cloudera Hadoop cluster. Some of the Data Vault configurations mentioned in this document may also be used to work with other Hadoop distributions. However, the Cloudera configurations are strictly for use with the Cloudera distribution of Hadoop. In this document, we assumed the Linux distribution as RedHat Enterprise Linux 6. Installation of the Hadoop client might differ on other distributions. Supported Versions Informatica Data Vault (File Archive Service) 6.1.1 Table of Contents Overview... 3 Architecture... 3 Install Hadoop Client... 3 Step 1. Create a Yum Repository... 3 Step 2. Install Hadoop Client Package Using Yum... 4 Configure Hadoop Client... 4 Step 1. Modify core-site.xml... 4 Step 2. Test Hadoop Client Configuration... 4 Configure Informatica Data Vault... 5 Configure Environment... 5 Step 1. Modify.bash_profile... 5 Step 2. Load the Informatica Data Vault Environment... 6 Step 3. Start Informatica Data Vault Service... 6 Step 4. Push a Test sct file to Cloudera Hadoop Cluster... 6 Step 5. Test a Query on Hadoop... 6 2

Overview The Cloudera Hadoop cluster is a high performance, load balanced cluster and most customers do not like installing software on any machine that is part of the cluster. This document talks about how to configure a different box that hosts Informatica Data Vault to work with the Hadoop cluster. Architecture The box that connects to the Hadoop cluster can host the Informatica Data Archive, Informatica Data Vault and the Cloudera Hadoop client. The recommended configuration for this box is at least 4 cores and 32 Gigabytes of RAM. Informatica Data Vault can communicate with the Hadoop cluster using the Cloudera Hadoop client. Other open source versions of Hadoop software are available from Apache. However, it has been observed that the Hadoop version that is available as open source is lower than the Cloudera Hadoop cluster s version and there have always been problems configuring the open source software to work with the Cloudera distribution. The supported Cloudera Distribution of Hadoop is CDH 4.x. Figure 1. Recommended Architecture Recommended Architecture Install Hadoop Client The recommended way of installing the Cloudera Hadoop client is using a yum tool. Configuring and installing any package through yum requires you to be a superuser on the Linux box. The following steps will help you install the Cloudera Hadoop client using yum. Step 1. Create a Yum Repository Create the Cloudera cdh4 repo file under /etc/yum.repos.d using the following command: # echo [cloudera-cdh4] name = Cloudera CDH, Version 4 baseurl = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/ gpgkey = http://archive.cloudera.com/redhat/cdh/rpm-gpg-keycloudera 3

gpgcheck = 1 > /etc/yum.repos.d/cloudera-cdh4.repo This allows the yum tool to download the Cloudera Hadoop client from the Cloudera repository and all its dependencies. Step 2. Install Hadoop Client Package Using Yum Install the Cloudera Hadoop client using the following command: # yum -y install hadoop-client The process can take a while to complete but at the end of the process you should be able to check the Hadoop version using the following command: # hadoop version Configure Hadoop Client Step 1. Modify core-site.xml Hadoop s configuration files are installed under /etc/hadoop/conf. The core-site.xml file contains configuration information that overrides the default values for core Hadoop properties. You need to modify the core-site.xml to look like the following snippet: <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description> A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://<namenode_name_or_ipaddress>:<port></value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> The default port for Cloudera Hadoop cluster s HDFS service on NameNode is 8020. Make sure that this port is open in the firewall of Cloudera Hadoop cluster s NameNode and the Cloudera Hadoop cluster s NameNode is configured to run using an IP address or hostname that is accessible outside the host. Step 2. Test Hadoop Client Configuration To test if the Hadoop client configuration is alright, you can run the following command as any user: $ hadoop fs -ls / If the result of the above command returns with a list of available directories in Hadoop, the Hadoop client configuration is successful. If there is any error, verify that the Hadoop client version is not lower than the Cloudera Hadoop cluster, or check to see if you can connect to the host and port specified in core-site.xml using the following command: $ telnet <namenode_name_or_ipaddress> <port> 4

Telnet has been termed obsolete and is not installed automatically on the latest Linux distributions. Hence, we might need to install telnet using the following command as superuser: # yum -y install telnet Configure Informatica Data Vault When making a new installation of Informatica Data Vault, in the Advanced Configuration section, change the value of Maximum VMEM to 20480 (indicates 20 G). For existing installations, you need to modify this property in ssa.ini Data Vault configuration file. On Linux, the number of agents that start automatically with the Data Vault Service is two. You need to have at least four agents for the loader to not crash loading files into Hadoop cluster. You need to add a section in the ssa.ini configuration file that describes the Hadoop connection. For this you need to change the ssa.ini Data Vault configuration file. The following snippet shows the sections that need to be added or edited with the parameters that require modifications bolded: [QUERY] THREADS=2 MAXVMEM=20480 MEMORY=512 TEMPDIR=/home/hadoop/ILM-FAS/temp SHAREDIR=/home/hadoop/ILM-FAS/temp [STARTER] AGENT_CONTROL=1 AGENT_COUNT=4 VERBOSE=2 SERVER_CONTROL=1 AGENT_CMD=ssaagent SERVER_CMD=ssaserver #EXE0=ssaservice start LOGDIR=/home/hadoop/ILM-FAS/fas_logs [HADOOP_CONNECTION cloudera] URL = ilmaustin14 PORT = 8020 Configure Environment Step 1. Modify.bash_profile Add the following lines to your.bash_profile file to allow Informatica Data Vault to read required libraries to access the Hadoop cluster: LD_LIBRARY_PATH=/usr/java/jdk1.7.0_21/jre/lib/amd64/server:/usr/lib64:$ LD_LIBRARY_PATH;export LD_LIBRARY_PATH CLASSPATH=/usr/lib/hadoop/hadoop-common.jar:/usr/lib/hadoop/hadoopannotations.jar:/usr/lib/hadoop/hadoopauth.jar:/usr/lib/hadoop/lib/commons-logging- 5

1.1.1.jar:/usr/lib/hadoop/lib/commons-lang- 2.5.jar:/usr/lib/hadoop/lib/commons-configuration- 1.6.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:/usr/lib/hadoop/lib/slf4j- api-1.6.1.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar:/usr/lib/hadoop/lib/log4j-1.2.17.jar:/usr/lib/hadoop- hdfs/hadoop-hdfs.jar:/usr/lib/hadoop/lib/commons-cli- 1.2.jar:/usr/lib/hadoop/lib/protobuf-java- 2.4.0a.jar:/usr/lib/hadoop/lib/commons-io-2.1.jar;export CLASSPATH Step 2. Load the Informatica Data Vault Environment Informatica Data Vault installs with a pre-configured script that can be used to load all the environment variables that are required by the Informatica Data Vault components. The script file is located in the Informatica Data Vault installation directory. You need to source this preconfigured script using the following command: $. ssaenv.sh Step 3. Start Informatica Data Vault Service There are different ways to start the Informatica Data Vault Server and its associated services. However, the most recommended way is a single command start which will load all the required services and start the number of agents mentioned in the configuration: $ ssa_starter -r & Step 4. Push a Test sct file to Cloudera Hadoop Cluster You can push a test sct file into Cloudera Hadoop cluster by running the following command: $ ssadrv -imp address_a.sct hdfs://cloudera/user Step 5. Test a Query on Hadoop You can test if you can query the sct file that is loaded into Hadoop by running the following command: $ ssau -q hdfs://cloudera//user/address_a.sct Authors Seetharama Khandrika Lead Software Developer Acknowledgements To construct this document, we have used a few references from Apache s web site and used the Cloudera free Hadoop distribution to know all the dependencies. The jars listed for the CLASSPATH variables would change based on the Hadoop version. 6