Configuring Hadoop Distributed File Service as an Optimized File Archive Store

Configuring Hadoop Distributed File Service as an Optimized File Archive Store 2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract This article provides information on how to configure Hadoop Distributed File System (HDFS) as an optimized file archive store in Data Archive. Data Archive uses the libhdfs API to archive and access data in HDFS. Supported Versions Data Archive 6.1.x Table of Contents Overview.... 2 Step 1. Install the libhdfs API Files.... 2 Step 2. Create a Directory in HDFS.... 3 Step 3. Create the Target Connection.... 3 Step 4. Run the Create Archive Folder Job.... 5 Step 5. Copy the Hadoop Connection to Other File Archive Service Configuration Files.... 5 Step 6. Validate the Connection to HDFS.... 5 Overview You can use the Hadoop Distributed File System (HDFS) as an optimized file archive store in Data Archive. To create an optimized file archive in HDFS, complete the following tasks: 1. Install the libhdfs API files. 2. Create a directory in HDFS. 3. Create an optimized file archive target connection. 4. Run the Create Archive Folder job. 5. Copy the connection to other File Archive Service configuration files. 6. Validate the connection to HDFS. Step 1. Install the libhdfs API Files The libhdfs API provides access to files in a Hadoop file system. Data Archive requires the libhdfs API files to access an optimized file archive in HDFS. The Hadoop installation includes the libhdfs API. The File Archive Service requires the following libhdfs files: commons-logging-api-1.0.4.jar hadoop-0.20.2-core.jar libhdfs.so (UNIX) or libhdfs.dll (Windows) To install the libhdfs API, copy the libhdfs files to the machines where the following File Archive Service components are installed: File Archive Service On Windows, copy the files to the root of the File Archive Service directory. 2

On UNIX, copy the files to <File Archive Service Directory>/odbc. File Archive Service agent On Windows or UNIX, copy the files to the root of the File Archive Service agent directory. If the File Archive Service agent is installed on multiple machines, copy the libhdfs API files to all machines that host a File Archive Service agent. File Archive Service plug-in for Data Archive On Windows, copy the files to <Data Archive Directory>\webapp\file_archive. On UNIX, copy the files to <Data Archive Directory>/webapp/file_archive/odbc. After the installation, verify that the CLASSPATH environment variable includes the location of the libhdfs files. Step 2. Create a Directory in HDFS In HDFS, create a directory for the optimized file archive. Step 3. Create the Target Connection In Data Archive, create a target connection to the optimized file archive and set the archive store type to Hadoop HDFS. The following list describes the properties that you need to set for the target connection: Staging Directory Directory in which the file archive loader temporarily stores data as it completes the archive process. Enter the absolute path for the directory. The directory must be accessible to the ILM Server. Number of Rows Per File Maximum number of rows that the file archive loader stores in a file in the optimized file archive. Default is 1 million rows. File Archive Data Directory Directory in which the file archive loader creates the optimized file archive. Enter the absolute path for the directory. You can set up the directory on a local storage or use Network File System (NFS) to connect to a directory on any of the following types of storage devices: Direct-attached storage (DAS) Network-attached storage (NAS) Storage area network (SAN) You can specify a different directory for each optimized file archive target connection. The directory must be accessible to the ILM Server and the File Archive Service. If you select an archive store in the Archive Store Type property, the file archive loader archives data to the archive store, not to the location specified in the File Archive Data Directory property. Instead, the file archive loader uses the file archive data directory as a staging location when it writes data to the archive store. File Archive Folder Name Name of the folder in the optimized file archive in which to store the archived data. The optimized file archive folder corresponds to the database in the archive source. 3

File Archive Host Host name or IP address of the machine that hosts the File Archive Service. File Archive Port Port number used by the ssasql command line program and other clients such as the SQL Worksheet and ODBC applications to connect to the File Archive Service. Default is 8500. File Archive Administration Port Port number used by the File Archive Service agent and the File Archive Administrator tool to connect to the File Archive Service. Default is 8600. File Archive User Name of the administrator user account to connect to the File Archive Service. You can use the default administrator user account created during the File Archive Service installation. The user name for the default administrator user account is dba. File Archive User Password Password for the administrator user account. Confirm Password Verification of the password for the administrator user account. Add-On URL URL for the File Archive Service for External Attachments component. The File Archive Service for External Attachments converts external attachments from the archived format to the source format. Required to restore encrypted attachments from the optimized file archive to the source database. Maintain Imported Schema Name Use schema names from the source data imported through the Enterprise Data Manager. By default, this option is enabled. The file archive loader creates a schema structure in the optimized file archive folder that corresponds to the source schema structure imported through the Enterprise Data Manager. It adds the transactional tables to the schemas within the structure. The file archive loader also creates a dbo schema and adds the metadata tables to the dbo schema. The imported schema structure is based on the data source. If source connections contain similar structures but use different schema names, you must import the source schema structure for each source connection. For example, you import the schema structure from a development instance. You export metadata from the development instance and import the metadata into the production instance. If the schema names are different in development and production, you must import the schema structure from the production instance. You cannot use the schema structure imported from the development instance. If this option is not enabled, the file archive loader creates the dbo schema in the file archive folder. The file archive loader adds all transactional tables for all schemas and all metadata tables to the dbo schema. Archive Store Type Storage platform for the optimized file archive. Select the Hadoop HDFS archive store. HDFS URL Hostname or IP address for the HDFS server. HDFS Port Port number to connect to HDFS. The default HDFS port number is 54310. 4

Command Path to the directory for the optimized file archive in HDFS. Do not include the HDFS prefix or host name. Step 4. Run the Create Archive Folder Job In Data Archive, run the Create Archive Folder job to create the file archive folder and the connection to HDFS. The Create Archive Folder job creates the file archive folder and adds a Hadoop connection entry to the ssa.ini file in the File Archive Service plug-in in Data Archive. The job sets the name of the file archive folder and the name of the Hadoop connection based on the folder name property specified in the target connection. For example, the File Archive Folder Name property in the target connection is set to HDFS_Sales. The Create Archive Folder job creates a file archive folder named HDFS_Sales and adds a Hadoop connection named HDFS_Sales to the ssa.ini file. The following example shows an entry for a Hadoop connection named HDFS_Sales in the ssa.ini file: [HADOOP_CONNECTION HDFS_Sales] URL = 10.17.40.25 PORT = 54310 Step 5. Copy the Hadoop Connection to Other File Archive Service Configuration Files The Hadoop connection definition on the machine that hosts Data Archive must match the Hadoop connection definition on the machines that host other File Archive Service components. Copy the Hadoop connection definition from the ssa.ini file on the machine that hosts Data Archive to the ssa.ini files on the machines that host the File Archive Service and File Archive Service agent. After you run the Create Archive Folder job, go to the File Archive Service plug-in directory in Data Archive and find the ssa.ini file. Copy the Hadoop connection definition to the ssa.ini file on the machine that hosts the File Archive Service. Additionally, if you have installed the File Archive Service agent on another machine, copy the Hadoop connection definition to the ssa.ini file on the machine that hosts the File Archive Service agent. If you have installed the File Archive Service agent on multiple machines, copy the Hadoop connection definition to the ssa.ini file on each machine that hosts a File Archive Service agent. Step 6. Validate the Connection to HDFS Verify the connection to HDFS from the File Archive Service and from the File Archive Service plug-in in Data Archive. Use the File Archive Service ssadrv administration command to validate the connection to the HDFS file archive store. On the machine that hosts the File Archive Service, run the following command: ssadrv a hdfs://<connection Name>/<Path to the file archive folder in HDFS> For example, run the following command: ssadrv a hdfs://hdfs_sales/data/sandqa1/infa_archive In this command, HDFS_Sales is the name of the Hadoop connection defined in the ssa.ini file and data/sandqa1/ infa_archive is the path to the optimized file archive folder named infa_archive in the HDFS file archive store. You can run the same command on the machine that hosts Data Archive to test the connection from Data Archive to HDFS. Author Marissa Johnston Staff Technical Writer 5

Acknowledgements Thanks to Vassiliy Truskov for his help in completing this article. 6