Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Similar documents

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Hadoop MultiNode Cluster Setup

Configure Windows 2012/Windows 2012 R2 with SMB Direct using Emulex OneConnect OCe14000 Series Adapters

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide

How to install Apache Hadoop in Ubuntu (Multi node/cluster setup)

CDH 5 Quick Start Guide

How to install Apache Hadoop in Ubuntu (Multi node setup)

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

VMware vsphere Big Data Extensions Administrator's and User's Guide

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

CDH installation & Application Test Report

HADOOP - MULTI NODE CLUSTER

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Deploy and Manage Hadoop with SUSE Manager. A Detailed Technical Guide. Guide. Technical Guide Management.

HSearch Installation

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

SAS Data Loader 2.1 for Hadoop

A Study of Data Management Technology for Handling Big Data

Terms of Reference Microsoft Exchange and Domain Controller/ AD implementation

Hadoop Basics with InfoSphere BigInsights

TP1: Getting Started with Hadoop

How To Install Hadoop From Apa Hadoop To (Hadoop)

Deploying Hadoop on SUSE Linux Enterprise Server

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

HADOOP CLUSTER SETUP GUIDE:

Single Node Hadoop Cluster Setup

Apache Hadoop new way for the company to store and analyze big data

Hadoop Lab - Setting a 3 node Cluster. Java -

Perforce Helix Threat Detection OVA Deployment Guide

Dell Reference Configuration for Hortonworks Data Platform

Hadoop (pseudo-distributed) installation and configuration

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

研發專案原始程式碼安裝及操作手冊. Version 0.1

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Platfora Installation Guide

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

XenClient Enterprise Synchronizer Installation Guide

How To Set Up A Firewall Enterprise, Multi Firewall Edition And Virtual Firewall

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Plexxi Control Installation Guide Release 2.1.0

Trend Micro Incorporated reserves the right to make changes to this document and to the products described herein without notice.

StarWind iscsi SAN: Configuring Global Deduplication May 2012

Hadoop Training Hands On Exercise

Revolution R Enterprise 7 Hadoop Configuration Guide

A Performance Analysis of Distributed Indexing using Terrier

Quick Install Guide. Lumension Endpoint Management and Security Suite 7.1

CycleServer Grid Engine Support Install Guide. version 1.25

Hadoop Size does Hadoop Summit 2013

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

installation administration and monitoring of beowulf clusters using open source tools

Introduction to VMware EVO: RAIL. White Paper

Prepared for: How to Become Cloud Backup Provider

Hadoop Installation. Sandeep Prasad

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Syncplicity On-Premise Storage Connector

Platfora Big Data Analytics

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Local Caching Servers (LCS): User Manual

HADOOP MOCK TEST HADOOP MOCK TEST II

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

HP StorageWorks 8Gb Simple SAN Connection Kit quick start instructions

Platfora Installation Guide

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Hadoop Installation Guide

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Reboot the ExtraHop System and Test Hardware with the Rescue USB Flash Drive

HP Departmental Private Cloud Reference Architecture

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Centrify Identity and Access Management for Cloudera

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Important Notice. (c) Cloudera, Inc. All rights reserved.

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Single Node Setup. Table of contents

Using Symantec NetBackup with Symantec Security Information Manager 4.5

PARALLELS SERVER 4 BARE METAL README

Virtual Appliance Installation Guide

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

Using The Hortonworks Virtual Sandbox

What the student will need:

Virtual Appliance Setup Guide

CLOUDERA REFERENCE ARCHITECTURE FOR VMWARE STORAGE VSPHERE WITH LOCALLY ATTACHED VERSION CDH 5.3

Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers

Tableau Spark SQL Setup Instructions

Hadoop Data Warehouse Manual

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Tutorial. General Instructions

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

IDS 561 Big data analytics Assignment 1

BIG-IP Virtual Edition Setup Guide for Linux KVM. Version 11.4

Apache Hadoop Cluster Configuration Guide

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

HDFS Users Guide. Table of contents

Transcription:

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster hardware specification... Installation and configuration of servers... 1. Verification of NIC profile/firmware/driver versions for all servers... 2. Assign IPv4 address to port 0 on every server... 3. Setup passwordless SSH on all servers... 4. Configure/etc/hosts file... 5. Install Java... 6. Disable Firewall... Installing and Configuring CDH Hadoop... Verification of Hadoop cluster... 1. Create a sample file... 2. Copy the file to HFDS... 3. Run the inbuilt *.jar application... Conclusion... References... 3 3 4 5 5 7 8 9 10 10 10 16 16 16 17 18 19 2 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Introduction The rapid growth of social media, cellular advances and requirements for data analytics has challenged the traditional methods of data storage and data processing for many large business and government entities. To solve the data storage and processing challenges, organizations are starting to deploy large clusters of Apache Hadoop a solution that helps manage the vast amount of what is commonly referred to as big data. The Emulex OneConnect family of OCe14000 10Gb Ethernet (10GbE) Network Adapters plays an important role in the Hadoop cluster, to move the data efficiently across the cluster. This lab guide describes the necessary hardware, software and configuration steps needed to deploy the Cloudera CDH 5.x (Cloudera Distribution Including Apache Hadoop) with the Emulex family of OCe14000 10GbE Network Adapters. A brief introduction to centralized management of OCe14000 Adapters using Emulex OneCommand Manager is also being reviewed. Intended audience: System and network architects and administrators Hardware requirements For this lab guide we implemented a five-node cluster, but this is scalable as required. Adding a new DataNode to a Hadoop cluster is a very simple process. However, NameNode s RAM and disk space must be taken into consideration before adding additional DataNodes. NameNode is the most important part of a Hadoop Distributed File System (HDFS). It keeps track of the directory tree of all the files in the file system, and tracks where in the cluster the file data is kept. It also is the single point of failure in a Hadoop cluster. With Hadoop 2.0, this issue has been addressed with the HDFS High Availability (HA) feature (refer to Apache s HDFS High Availability Guide ). It is always recommended to implement high availability in the cluster. A DataNode stores data in the HDFS. A cluster will always have more than one DataNode, with data replicated across all of them. The required hardware for implementing and testing Hadoop with OCe14000 10GbE Adapters is listed below: Hardware components Quantity Description Comments Server 4 or more Any server with Intel/AMD processors, which support Linux 1 NameNode (MasterNode); 3 or more DataNode (SlaveNode, JobHistoryNode, ResourceManager) Hard drives 2 or more per server Any SAS or SATA drive RAID controller 4 or more Anyserver, which supports Linux RAM Emulex OCe14000 Network Adapters 48GB+ per server 4 or more 10GbE network adapters The OCe14102-UM adapter was used in this configuration Switch 1 10Gbps switch Cables 4 or more 10Gbps optical SFP+ cables Figure 1. List of hardware required. 3 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Figure 2. Sample configuration. Recommended Hadoop cluster hardware specification There are a lot of factors affecting the choice of hardware for a new Hadoop cluster. Hadoop runs on industry-standard hardware. However, selecting the right mix of CPU, hard drives or RAM for any workload can help you make the Hadoop cluster more efficient. The recommendations below are formulated from Cloudera. Please consider your organization s workload before selecting the hardware components. Node Hardware Components Specifications NameNode CPU Hard Drive RAM CPU 2 quad/hex/oct core CPUs, running at least 2GHz 2 or more 1TB hard disks in a RAID configuration 48 128GB 2 quad/hex/oct core CPUs, running at least 2GHz DataNode Hard Drive 2 or more 1TB hard disks in a JBOD configuration RAM 64 512GB Figure 3. Recommended server specifications for a NameNode and DataNode. 4 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Software requirements Software Components Quantity Description Application Note Components OS 5 or more Any supported Linux or Windows OS CentOS 6.4 Java 5 or more Java 1.6.x or higher Java 1.7.0 Cloudera Manager 1 Centralized management for Cloudera s CDH Cloudera Manager 5.3.3 CDH 5 1 CDH 4.x or higher for implementing Hadoop 2.x CDH 5.0.6 OCe14102 Firmware 5 or more Download the latest firmware from the Emulex website. 10.2.370.19 OCe14102 Driver 5 or more Download the latest driver from the Emulex website. 10.2.363.0 Emulex OneCommand Manager 5 or more Download the latest version of OneCommand Manager from the Emulex website. 10.2.370.16 Figure 4. Software requirements. Installation and configuration of servers Install CentOS 6.4 on five servers. For this lab guide, five different names were assigned to the servers. Essential services like ResourceManager and JobHistory Server were split across the SlaveNode. The names along with the roles are listed below: 1. Elephant : NameNode Server (MasterNode) 2. Monkey : JobHistory Server, DataNode (SlaveNode) 3. Horse : ResourceManager, DataNode (SlaveNode) 4. Tiger : DataNode (SlaveNode) 5. Lion : DataNode (SlaveNode) 6. Workstation-centos: Centralized Cloudera Manager Server Connect the OCe14102 Adapter to the PCI Express (PCIe) slots. Upgrade the adapter with the latest firmware, driver and version of OneCommand Manager. Connect port 0 of every server to the top-of-rack (ToR) switch and ensure that the link is up. Note: There will be a system reboot required for upgrading that firmware. 1. Verification of NIC profile/firmware/driver versions for all servers. Verify the Network Interface Card (NIC) profile in the following steps: a. Start OneCommand Manager. b. Select the OCe14102-UM Adapter and go to the Adapter Configuration tab. c. Ensure that the personality is set to NIC. 5 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Figure 5. NIC personality verification. Verify the firmware version in the following steps: a. Select the OCe14102-UM Adapter and go to the Firmware tab. b. Ensure that the firmware version is same as the one in which you have upgraded. Figure 6. Firmware verification. 6 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Verify the the driver version in the following steps: a. Start OneCommand Manager. b. Select NIC from Port 0 or Port 1 of the OCe14102-UM Adapter and go to the Port Information tab. c. Ensure that the driver version is the same as the one in which you have upgraded. Figure 7. Driver verification. 2. Assign IPv4 address to port 0 on every server. a. Assign the following IP addresses to port 0 of the OCe14102 Adapter using Network Manager or ipconfig. 7 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

b. Verify the IP address assignment using OneCommand Manager. Figure 8. IP address verification. 3. Setup passwordless SSH on all servers. a. Generate authentication Secure Shell (SSH)-keygen keys on Elephant using ssh-keygen-t rsa. Figure 9. ssh-keygen output. Note: logged in user should have privileges to run the ssh-keygen command or run the sudo command before running the sshkeygen command. 8 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

b. Run ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostname for Monkey, Lion, Tiger and Horse. Figure 10. ssh-copy sample output. Note: ssh-copy-id command must be run for all the servers in the cluster to enable the login to SSH without a password (passwordless login). c. Verify that the password ssh is working. Figure 11. Verification of passwordless ssh. 4. Configure/etc/hosts file. The host names of the servers in the cluster along with the corresponding IP address need to be added to the /etc/hosts file. This is used for the operating system to map host names to IP addresses. a. The Port 0 IP assigned to the OCe14000 series adapter corresponding to each host should be added to the /etc/hosts file. Add the lines listed below to the /etc/hosts file on Elephant. 10.8.1.11 Elephant 10.8.1.15 Horse 10.8.1.12 Monkey 10.8.1.12 Tiger 10.8.1.14 Lion 10.8.1.45 workstation-cento6u4 b. Copy the /etc/hosts file from Elephant to Horse, Monkey, Tiger, Lion and workstation-cento6u4 using the scp command Figure 12. scp/etc/hosts. Note: /etc/hosts should be copied to all the servers in the Hadoop cluster. 9 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

5. Install Java Download and install Java on all of the hosts in the Hadoop cluster. 6. Disable Firewall For this Application Note, the firewall was disabled. Please consult your network administrator for allowing the necessary services by the firewall to make the Hadoop cluster work. Installing and Configuring CDH Hadoop Notes: n Please follow the Cloudera instructions on installing CDH Hadoop according to your environment s need. n For this guide we have implemented CDH 5.3 using the package method using Cloudera Manager. n All config file changes are made using Cloudera Manager and are deployed on the clients using Cloudera Manager. n Config files for Hadoop are present under /usr/lib/hadoop/etc/hadoop or /etc/hadoop. n Scripts are located in the Appendix. 1. Download Cloudera Manager on the centralized management server. Follow the instructions listed on the page to install the Cloudera Manager. 2. Open a browser and type the IP address and login. Select Cloudera Express from the start page. Figure 13. Cloudera Manager. 10 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

3. Follow the instructions to install CDH 5.3.3 Figure 14. CDH 5.3.3 install. 11 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

4. Install the services required for your environment. For this application note, under Custom Services, HDFS and YARN services were chosen. Assign roles to the host in the cluster. Figure 15. CDH 5.3.3 install. 12 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

5. Set up the slaves, *.xml config files and hadoop_env.sh according to the needs of your environment. For this lab guide, the configuration changes were made for HDFS and YARN under the configuration tab: Figure 16. Configuration changes. 6. Below are the configuration changes which were made: Name mapreduce.framework.name mapreduce.jobhistory.address mapreduce.jobhistory.webapp.address yarn.app.mapreduce.am.staging-dir Value yarn monkey:10020 monkey:19888 /user mapred.child.java.opts -Xmx512m dfs.namenode.name.dir dfs.datanode.data.dir yarn.resourcemanager.hostname yarn.application.classpath yarn.nodemanager.aux-services yarn.nodemanager.local-dirs yarn.nodemanager.log-dirs yarn.nodemanager.remote-app-log-dir n.log-aggregation-enable file:///disk1/dfs/nn,file:///disk2/dfs/nn file:///root/desktop/data1/dn horse Leave the value specified in the file mapreduce_shuffle file:///root/desktop/data1/nodemgr/local /var/log/hadoop-yarn/containers /var/log/hadoop-yarn/apps true Figure 17. Configuration parameters. 13 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

7. Deploy the configuration to the clients by going to the home page and under Cluster 1, click on Deploy Client Configuration. Figure 18. Deploy client configuration. 8. Start the Cloudera Management Service by clicking on Start under Cloudera Management Service Actions. Figure 19. Start Cloudera management service. 14 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

9. Start all the services by clicking on Start under Cluster1 on the home page. Figure 20. Start the services in the cluster. 10. The status and roles on the clients can be seen under the Hosts tab. Figure 21. Service status on clients. 15 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Verification of Hadoop cluster To verify the Hadoop cluster, a simple wordcount example can be run. 1. Create a sample file. Create a sample file with a few test sentences using the VI editor on Elephant: vi input_word_count. Figure 22. Sample file. 2. Copy the file to HFDS. For Elephant, create a directory named in in HDFS and copy the sample file input_word_count to HDFS under the in directory: hadoop fs -mkdir /in hadoop fs -put input_word_count /in hadoop fs -is /in The directory, file and the actual location of the file can be viewed using a web interface on the NameNode. The web address is elephant:50070. This will also show that the replication factor of each block is 3 and it is saved at three different DataNodes (Figures 21 and 22). Figure 23. Verification of HFDS directory listing using web interface. 16 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Figure 24. File size in HFDS. Figure 25. Replication factor and file location information on the HFDS. 3. Run the inbuilt *.jar application. Run the command listed below to start the wordcount application. /usr/lib/hadoop/bin/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /in/input_word_count /out The output will be displayed on the terminal. The result location can also be viewed using the web interface. 17 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Figure 26. Sample output. Figure 27. Sample web interface output in HFDS. Conclusion A complete overview of hardware and software components required for successfully deploying and evaluating Cloudera CDH Hadoop with Emulex OCe14000 Network Adapters was presented in this lab guide. More details about the Hadoop architecture and Cloudera can be obtained from the Apache Hadoop official website: http://hadoop.apache.org/ and http://www.cloudera.com/content/cloudera/en/home.html respectively. The solution described in this lab guide is scalable for a larger number of DataNodes and racks to suit the environment and needs of your organization. Emulex OCe14000 Network Adapters can be used in the cluster to move the data efficiently across the nodes. 18 LAB GUIDE Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

References Solution Implementer s Lab: http://www.implementerslab.com/ Cloudera : http://www.cloudera.com/content/cloudera/en/home.html Emulex Ethernet networking and storage connectivity products: http://www.emulex.com/products/ethernet-networking-storage-connectivity/ Apache Hadoop overview: http://hadoop.apache.org/ How to set up a multimode Hadoop cluster: http://wiki.apache.org/hadoop/#setting_up_a_hadoop_cluster Cloudera resources: http://www.cloudera.com/content/cloudera/en/resources.html Wikipedia: http://en.wikipedia.org/wiki/apache_hadoop DataNode: http://wiki.apache.org/hadoop/datanode NameNode: http://wiki.apache.org/hadoop/namenode For product information and a complete list of distributors, please visit our website at www.emulex.com Avago, Avago Technologies, Emulex, OneConnect, OneCommand and the Emulex logo are trademarks of Avago Technologies in the United States and other countries. All other brand and product names are the trademarks of their respective owners. Copyright 2015 Avago Technologies. All rights reserved. ELX15-2579 5/15