Intel Distribution for Apache Hadoop on Dell PowerEdge Servers

Similar documents
Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

Microsoft SharePoint Server 2010

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Deploying Hadoop with Manager

Accelerating and Simplifying Apache

Optimizing SQL Server Storage Performance with the PowerEdge R720

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Dell s SAP HANA Appliance

Platfora Big Data Analytics

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

EMC Unified Storage for Microsoft SQL Server 2008

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Dell In-Memory Appliance for Cloudera Enterprise

Enabling High performance Big Data platform with RDMA

Apache Hadoop: The Big Data Refinery

Fast, Low-Overhead Encryption for Apache Hadoop*

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

CSE-E5430 Scalable Cloud Computing Lecture 2

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Private cloud computing advances

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

The Greenplum Analytics Workbench

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Hortonworks Data Platform Reference Architecture

HP reference configuration for entry-level SAS Grid Manager solutions

Dell Virtual Remote Desktop Reference Architecture. Technical White Paper Version 1.0

Maximum performance, minimal risk for data warehousing

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Real-Time Big Data Analytics for the Enterprise

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Benchmarking Hadoop & HBase on Violin

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

Dell Desktop Virtualization Solutions Simplified. All-in-one VDI appliance creates a new level of simplicity for desktop virtualization

DELL s Oracle Database Advisor

IBM System x reference architecture solutions for big data

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

A very short Intro to Hadoop

HadoopTM Analytics DDN

Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Implement Hadoop jobs to extract business value from large and varied data sets

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Big Data - Infrastructure Considerations

Hadoop Architecture. Part 1

Accelerate Big Data Analysis with Intel Technologies

Microsoft SharePoint Server 2010

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Networking in the Hadoop Cluster

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

CloudSpeed SATA SSDs Support Faster Hadoop Performance and TCO Savings

CDH AND BUSINESS CONTINUITY:

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Microsoft Analytics Platform System. Solution Brief

Hadoop on the Gordon Data Intensive Cluster

Broadcom 10GbE High-Performance Adapters for Dell PowerEdge 12th Generation Servers

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Get More Scalability and Flexibility for Big Data

The Future of Data Management

Minimize cost and risk for data warehousing

RSA Security Analytics Virtual Appliance Setup Guide

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Intel Platform and Big Data: Making big data work for you.

BIG DATA TRENDS AND TECHNOLOGIES

Protecting Microsoft SQL Server with an Integrated Dell / CommVault Solution. Database Solutions Engineering

Integrated Grid Solutions. and Greenplum

Interactive data analytics drive insights

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Cost-Effective Business Intelligence with Red Hat and Open Source

Apache Hadoop Cluster Configuration Guide

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Dell High Availability Solutions Guide for Microsoft Hyper-V

Dell* In-Memory Appliance for Cloudera* Enterprise

Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide

How To Backup And Restore A Database With A Powervault Backup And Powervaults Backup Software On A Poweredge Powervalt Backup On A Netvault 2.5 (Powervault) Powervast Backup On An Uniden Power

cloud functionality: advantages and Disadvantages

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

New Features in SANsymphony -V10 Storage Virtualization Software

The Methodology Behind the Dell SQL Server Advisor Tool

Transcription:

Intel Distribution for Apache Hadoop on Dell PowerEdge Servers A Dell Technical White Paper Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution Architect Dell Solution Centers Dave Jaffe, Ph.D. Solution Architect Dell Solution Centers Rob Wilbert Solution Architect Dell Solution Centers

Executive Summary This document details the deployment of Intel Distribution for Apache Hadoop* software on the PowerEdge R720XD. The intended audiences for this document are customers and system architects looking for information on implementing Apache Hadoop clusters within their information technology environment for Big Data analytics. The reference configuration introduces all the high-level components, hardware, and software that are included in the stack. Each high-level component is then described individually. Dell developed this document to help streamline deployment, provide best practices and improve the overall customer experience. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. 2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own. July 2013 2 Dell Intel Distribution for Apache Hadoop

Table of Contents 1 Introduction... 5 2 Dell Solution Centers... 7 3 Dell s Point Of View on Big Data... 8 4 Intel Distribution for Apache Hadoop... 9 Hadoop Use-Cases... 10 Intel s Contributions to Open Source... 10 5 Intel Hadoop Solution Software Components... 12 Server Roles...12 6 Best Practices for Running Intel Distribution of Apache Hadoop on Dell... 14 Node Count Recommendations... 14 Hardware Recommendations... 15 Monitoring... 15 Resiliency... 16 Performance... 17 Software Considerations... 18 Installation Environment Assumptions... 18 High Availability... 18 Installation Considerations... 19 7 Testing... 21 HiBench...21 Teragen / Terasort...21 Tested Configuration...21 Tuning and Optimization of Workloads... 22 8 Conclusions... 24 9 Resources... 25 Links... 25 Additional Whitepapers... 25 Tables Table 1. Recommended Cluster Sizes... 14 Table 2. Software Revisions...21 Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration...21 3 Dell Intel Distribution for Apache Hadoop

Table 4. PowerEdge R720XD Datanode As Tested Configuration... 22 Table 5. Key Hadoop Configuration Parameters... 22 Figures Figure 1. Dell Solution Centers Locations... 7 Figure 2. Big Data Demands... 8 Figure 3. Intel Foundational Technologies for Hadoop Performance... 9 Figure 4. Dell Big Data Cluster Logical Diagram...13 Figure 5. Ganglia Performance Monitor Tool (Included with IDH)...13 Figure 6. Cluster Network Diagram... 15 Figure 7. Dell s OpenManage Power Center... 16 Figure 8. Dell R720XD models with 2.5 and 3.5 inch drives... 17 Figure 9. The Role Assignment dropdown for HDFS roles... 19 Figure 10. Mount Points are configured below for the dfs.data.dir directories... 20 Figure 11. Intel Active Tuning Technology... 23 4 Dell Intel Distribution for Apache Hadoop

1 Introduction Hadoop is an Apache open source project being built and used by a global community of contributors, using the Java programming language. Hadoop s architecture is based on the ability to scale in a nearly linear capacity. By harnessing the power of this tool, many customers who previously would have had difficulty sorting through their complex data can now deliver value faster, provide deeper insight, and even develop new business models based off the speed and flexibility these analytics provide. However, installing, configuring and running Hadoop is not trivial. There are different roles and configurations that need to be deployed on various host computers. Designing, deploying and optimizing the network layer to match Hadoop s scalability requires consideration for the type of workloads that will be running on the Hadoop cluster. These issues are complicated by both the fast-moving pace of the core Hadoop project and the challenges of managing a system designed to scale to thousands of nodes in a cluster. Dell s customer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoop solutions running on highly scalable hardware. Dell listened to its customers and partnered with Intel to design a Hadoop solution that is unique in the marketplace, combining optimized hardware, software, and services to streamline deployment and improve the customer experience. Intel has created a high quality, controlled distribution of Hadoop and offers commercial management software, updates, support and consulting services. The Intel Distribution for Apache Hadoop (IDH) software includes: The Intel Manager for Apache Hadoop software to install, configure, monitor and administer the Apache Hadoop cluster Enhancements to HBase and Hive for improved query performance and end user experience Resource monitoring capability using Nagios and Ganglia in the Intel Manager Superior security and performance through tightly integrated encryption and compression, authentication and access control. Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides a foundational platform for Intel to offer additional solutions as the Apache Hadoop ecosystem evolves and expands. Aside from the Apache Hadoop core technology (HDFS, MapReduce, etc.) Intel has designed additional capabilities to address specific customer needs for Big Data applications such as: Optimal installation and configuration of the Apache Hadoop cluster Monitoring, reporting, and alerting of the hardware and software components Providing job-level metrics for analyzing specific workloads deployed in the cluster Infrastructure configuration automation 5 Dell Intel Distribution for Apache Hadoop

In recent tests in the Dell Solution Center, the Intel Distribution for Apache Hadoop Release 2.4.1 was installed and tested on a cluster of Dell PowerEdge R720 servers, resulting in a set of best practices for installing IDH on Dell clusters. The next sections describe the role of the Dell Solution Centers and Dell s point of view on Big Data, followed by details of the IDH solution and IDH software components. Finally the best practices developed by the Solution Center and the results of the IDH on Dell tests are described. 6 Dell Intel Distribution for Apache Hadoop

2 Dell Solution Centers The Dell Solution Centers (DSC) are a global network of connected labs that allow Dell to help customers architect, validate and build solutions across Dell s entire enterprise portfolio. The Dell Intel Cloud Acceleration Program (DICAP), a team within the Dell Solution Centers, has the mission of providing customer engagements on the topics of Cloud and Big Data. With centers in every region, the DSC engages customers through informal 30-60 minute briefings, longer half-day architectural design sessions, and one to two-week proof-ofconcept tests that enables customers to kick the tires of Dell solutions prior to purchase. Interested customers should engage with their Dell account team to access the services of the DSC. Figure 1. Dell Solution Centers Locations Sao Paulo and Dubai coming in the second half of 2013 7 Dell Intel Distribution for Apache Hadoop

3 Dell s Point Of View on Big Data Big Data is a term often hyped in the IT press. There are many different interpretations of what exactly this means. In Dell s point of view the methods and principles of Big Data aren t new to the computer industry. In High Performance Clustered Computing (HPCC), data warehouses, and traditional databases, Dell has been providing these solutions for years. What has changed is the scale at which such tools need to operate. Every new device in use in today s society gathers more and more data and the need to store, report and analyze it is paramount. The term big can be implied on a variety of different scales: (See Figure 2) Volume no longer in the realm of gigabytes, but rather terabytes or petabytes. Velocity devices now can generate more data in a small time than can be ingested using traditional means. Variety with the data types and schemas of all the various datasets differing so much, being able to use a common datastore and to query across them provides tremendous value. Figure 2. Big Data Demands 8 Dell Intel Distribution for Apache Hadoop

4 Intel Distribution for Apache Hadoop Dell continues to hear from customers about their Big Data challenges, specifically a need for solutions that allow flexibility and choice while enabling key insights from their data. Based on customer conversations and Dell s experience in providing Hadoop solutions, one size does not fit all. Each Hadoop distribution offers unique features and benefits. For this very reason, Dell is introducing the partnership with Intel for the Intel Distribution for Apache Hadoop* software on the PowerEdge R720XD. The Dell and Intel partnership is good for all customers that want value from their data. Both companies share a common goal to help build a robust Apache Hadoop ecosystem that is enterprise ready, allowing all customers to take advantage of this disruptive technology. The partnership provides stability to the Apache Hadoop open source project; both companies have long term strategies that will help drive the right capabilities and features bringing the most value to customers. Intel brings a unique value proposition for customers: the ability to enable an optimized solution from the CPU silicon all the way to the Hadoop distribution. Intel is is the only vendor that can marry CPU technologies, SSD technology and 10Gb Ethernet to benefit Hadoop performance. The Intel Distribution for Apache Hadoop software focuses on performance and security. The Dell and Intel strategy is to reinforce the Hadoop distribution by making it more enterprise ready and provide a viable platform for big data workloads in all IT environments. The Intel Distribution for Apache Hadoop software is especially suited for use cases where security and performance and ease of data management are key needs. Figure 3. Intel Foundational Technologies for Hadoop Performance 9 Dell Intel Distribution for Apache Hadoop

Hadoop Use-Cases The Intel Distribution for Apache Hadoop has been deployed in many different customer scenarios. A few use cases that stand out are in healthcare, telecommunications and smart-grid technology: Healthcare Customers use the massive database capabilities of IDH to store and process the human genome, evaluate pharmaceutical results and make patient care decisions. In genomic research,, the fact that each human genome consists of 3.2 billion base-pairs with upwards of 4 million variants, drives the need for a cost-effective, high performance, scalable data processing engine.. At the same time, the deep security enhancements IDH provides are of major importance to the healthcare industry s strict compliance regulations. Telecommunications More and more mobile devices are getting into the hands of people all over the world. The billing systems for mobile providers need to be able to track call lengths and durations, text messages and data usage. More importantly they need to be able to report on this in near real-time. Hadoop is used instead of traditional massively parallel processing (MPP) and data-warehouse (DW) technologies due to its lower total cost of ownership (TCO) and inherent fault-tolerance. Energy Smart-Grid Mobile devices aren t the only thing generating new data streams. Smart power meters generate large streams of sensor data that can be used by energy and utility companies to optimize service delivery. The ability to efficiently store this data is allowing these companies to increase the rate of collection and provide additional, more granular detail. Traditional databases are proving to be incapable of handling the ingestion rate of this data at an affordable cost. Intel s Contributions to Hadoop As with many other open source projects, Hadoop s power owes itself to the community that developed it. Contribution to open source projects, either directly, or by enhancing the ecosystem drives further adoption and deepens utilization. Intel has a long history of both contributing to core open source projects (Linux Kernel, Hadoop and KVM) as well as creation of complementary projects. Two key programs to note in the context of Hadoop are: 10 Dell Intel Distribution for Apache Hadoop

Project Rhino This Intel-driven project enhances the data protection capabilities of Hadoop to address the security and compliance challenges around emerging use-cases. More details can be found at https://github.com/intel-hadoop/project-rhino/ Project Panthera This project s goal is to provide full SQL support to help companies integrate Hadoop more deeply with their existing data analytics processes. More details can be found at https://github.com/intel-hadoop/project-panthera. 11 Dell Intel Distribution for Apache Hadoop

5 Intel Hadoop Solution Software Components Hadoop Distributed File System (HDFS) This is the clustered file system that is at the core of the Hadoop software stack. When data is stored on this file system it s automatically distributed for both resiliency and redundancy. In the default configuration, every file is stored 3 times on 3 different nodes. With Intel Hadoop, tunable parameters can be set to increase or decrease the file replication level as the file access frequency increases or decreases. MapReduce This is the distributed batch-oriented parallel processing framework that enables data analysis at a large scale. This framework is accessed by writing Java-based MapReduce jobs that get executed against datasets in HDFS. Hive Hive makes accessing the power of MapReduce more familiar to existing database customers. It exposes the data that resides on HDFS as a SQL-like database. Standard SQL queries run against this data will be translated into MapReduce by Hive and executed behind the scenes. With Intel Hadoop, Hadoop queries can run faster on data sets in Hbase. HBase Some use-cases dictate the need for faster response times than a batch-based job through Hive or MapReduce. For these use cases, HBase provides a non-relational, column-based, distributed database that resides directly on top of HDFS. This allows users to leverage HDFS s massive scalability to provide service to emerging non-traditional databases. The Hbase distribution in IDH is tuned to perform ad hoc queries faster via Hive for large datasets. Server Roles Name Node/JobTracker(s) These nodes serve as control nodes for the HDFS, MapReduce, and HBase processes. For HDFS, they own the block map and directory tree for all the data on the cluster. With MapReduce, they own the Job Tracker daemon that handles job execution and monitoring. Lastly with HBase, these servers are responsible for running the monitoring processes as well as owning any metadata operations. Production environments should have a primary and at least one standby Name Node. Data Node(s) These are the nodes that hold the data as well as execute the MapReduce jobs. They are generally filled with large amounts of local disks, enabling the parallel processing and distributed storage features of Hadoop. The number of Data Nodes is dictated by use case. Adding additional Data Nodes increases both performance and capacity simultaneously. Edge Node(s) These servers lie on the perimeter of the dedicated Hadoop network. They are where external users and business processes interact with the cluster. Often times they will have a number of Network Interface Cards (NICs) attached to the Hadoop network as well as separate NICs attached to the enterprise s production IT network. More Edge nodes can be added as external access requirements increase. Intel Manager Node This node is where the installation of the Intel Manager software will reside. It runs the configuration management processes, web server software, and performance monitoring software. In production installations, a dedicated server should fulfill this task. In smaller installations such at the one employed by Dell in these tests, this role was shared with the Edge Node. 12 Dell Intel Distribution for Apache Hadoop

Figure 4. Dell Big Data Cluster Logical Diagram Figure 5. Ganglia Performance Monitor Tool (Included with IDH) 13 Dell Intel Distribution for Apache Hadoop

6 Best Practices for Running Intel Distribution on Dell Node Count Recommendations Dell recognizes that use-cases for Hadoop range from small development clusters all the way through large multi petabyte production installations. Dell has a Professional Services team that sizes Hadoop clusters for a customer s particular use. As a starting point three cluster configurations can be defined for typical use: Minimum Development Cluster This is targeted at functional testing and may even be built from existing equipment. However, the performance of these types of clusters can be significantly lower as they don t benefit from the highly distributed nature of HDFS. Recommended Small Cluster This is a good starting point for customers taking the initial steps into running IDH in production. It provides some layers of resiliency that is expected in today s production IT world. Recommended Production Cluster This configuration provides all the available options for resiliency both at a hardware layer and software layer. In addition, it allows for an adequate number of data nodes to demonstrate the performance benefits of distributed storage and parallel computing. Table 1. Recommended Cluster Sizes Minimum Development Cluster Recommended Small Cluster Recommended Production Cluster Name Node(s) 2 1 1 2 2 2 2 Edge Node(s) 0 1 1 1 Data Node(s) 3 5 15 Intel Manager Node 0 1 1 1 14 Dell Intel Distribution for Apache Hadoop

1 GbE Switches 1 1 2 10 GbE Switches 0 2 2 Rack Units 9U 20U 42U 1 In this case a single node serves as the Name, Job Tracker, Edge and Intel Manager Node. 2 In some cases a single server can serve as both the Name and Job Trackers Figure 6. Cluster Network Diagram Hardware Recommendations Dell s complete portfolio really shines when building on comprehensive solutions. From the servers to the switches and even on down to the Racks and monitoring tools, the value of deploying on Dell is readily apparent. Monitoring Using the Dell Remote Access Card (DRACs) in the servers Dell customer can identify increases in power consumption and temperature through as they exercise the disks and CPUs. One great tool to aid with this is Dell s OpenManage Power Center. This tool uses the Intel Network Node Manager technology built into Dell Remote Access Controller (DRAC) to provide metrics and trigger alert events based on customer criteria. 15 Dell Intel Distribution for Apache Hadoop

Figure 7. Dell s OpenManage Power Center Resiliency In production clusters it s imperative to keep an eye towards mitigating as many points of failure as possible. However, it is important to keep in mind that Hadoop (both through HDFS and MapReduce) is meant to be natively tolerant of failures and will take care of much of the needed underlying work. That said, when investing in building a robust and resilient configuration here are key areas to focus on: Switches Multiple stacked Force 10 switches should be used for high availability. Force 10 S60 1GbE switches utilize stacking modules which provide for easier switch management and faster inter-switch communication. On the Force 10 S4810s there is the option of either stacking via the 10 or 40 GbE ports (FW 8.3.12+) or implementing Virtual Link Trunking if you plan to scale beyond the stacking limitations (See switch documentation for configuration maximums). NICs Either two single-port NIC cards or two dual-port cards are recommended in the administration servers to guard against PCI-E slot failures. This is not as crucial on datanodes due to datanode redundancy. Disks RAID is only recommended in the administration servers such as the Namenode. In the Data nodes it s strongly recommended to put as many separate disks as possible (no RAID). The flexibility of the PowerEdge R720XD really shines here since it can hold either (12) 3.5 drives or (24) 2.5 drives. 16 Dell Intel Distribution for Apache Hadoop

Figure 8. Dell R720XD models with 2.5 and 3.5 inch drives Performance Performance optimization is a matter that varies greatly from customer to customer. There are a few principles that should be considered in order to optimize cluster performance. Network While 10 GbE isn t required, multiple bonded NICs of the fastest speed possible are strongly recommended for the data network. Workloads vary on whether or not they can truly benefit from a fast network, but with the prevalence of 10 GbE, it would be a wise idea to invest ahead of the curve. You ll also want enterprise-grade switches with deep per-port packet buffers in order to handle the volume and density of traffic Hadoop can generate. For 1 GbE Dell Force10 Series 60 work well and at 10 GbE Dell Force10 S4810s are optimal. Disks A key principle of performance tuning is to eliminate input/output (IO) starvation at the CPU layer and contention at the disk level. From this comes the initial recommendation of a 1:1 ratio of disk spindle to physical processor core (with hyperthreading counting as half of one physical core for this purpose). The correct choices of disks and processors totally depends on the workload, which can vary from the heavily storage -centric, with massive disks and few processors, to heavily processor centric, with many cores and PCI-E SSDs. The Dell Professional Services team can provide consultation and assessment to help customers achieve the proper balance. The Dell PowerEdge R720XD provides the excellent flexibility with regards to drive and socket configurations. Memory Few Hadoop use-cases will be memory constrained but administration servers should have sufficient memory for index caching (128GB for a robust configuration). For the data nodes, while, there are emerging use-cases that call for high amounts of memory, it s been determined through Hadoop customer engagements in the Dell Solution Centers that 64GB is a good target initially. CPUs As mentioned above, the use-case will determine the correct balance of CPU, Memory, and disk speed. In performance use-cases the most cores (balancing out spindle count if not SSD) and the highest possible frequency CPUs are recommended. However, if you were more interested in storage capacity, you could look at some of the Intel Xeon E5-2600L series processors that are more energy efficient. 17 Dell Intel Distribution for Apache Hadoop

Software Considerations Installation Environment Assumptions Updated Operating System the selected OS should have appropriate updates applied prior to IDH installation. The IDH documentation lists supported OS versions as well as required updates. Package Management As part of the installation an existing OS package repository needs to be referenced. Additionally, a new repo for IDH software needs to be created. In some cases (Red Hat Enterprise Linux) this may mean registering the OS with the proper credentials. DNS Forward and reverse name resolution are required for installation. Hosts to host communication will be handled by hostname so this becomes imperative. This can be accomplished via /etc/hosts or a DNS server. NIC Bonding In order to get as much bandwidth and resiliency as possible, Dell, recommends implementing bonding on the NICs. In these tests, mode 6 (balance-alb) was used. Production Network Connectivity The Edgenode needs to be connected to the user s existing network in order to facilitate access to the cluster. The speed of this link should meet the needs of the inbound data ingestion plans (both in number of users/processes as well as volume of data). High Availability Production Hadoop workloads require a high degree of resiliency to achieve desired uptime goals. In IDH 2.4.1 High Availability (HA) is handled in an Active/Passive manner using a number of components:. Distributed Replicated Block Devices (DRBD) allows a logical device to be mirrored between two disparate systems Pacemaker a Cluster Resource Management (CRM) framework that starts, stops, monitors and migrate resources automatically. Corosync a messaging framework, which Pacemaker uses, for internode communication. These tools, when used together, provide layers of redundancy for both the HDFS NameNode service and the MapReduce JobTracker. In order to enable HA, additional hardware may be required in the namenodes including extra NICs, more memory, and additional disks. While both, Namenode HA service as well as Jobtracker HA service failover is completely automatic, once the failover completes, in-flight jobs will be required to be resubmitted. High availability will require some additional network configuration as well. Virtual hostnames and IP addresses for both the NameNode and the TaskTracker HA functions must be identified and recorded in all /etc/hosts files, or DNS tables. It s worthy of note that the IDH 2.4 release is based off of the 1.x Hadoop open source project that had no HA option inherent, but Intel s distribution adds this capability. 18 Dell Intel Distribution for Apache Hadoop

Installation Considerations Role Assignments During the installation, the setup wizard prompts for specific role assignments of the cluster servers. It s a good idea to use the Edit Roles button on the last page of the wizard to double-check that each of the parameters was set correctly, as shown in Figure 9. Figure 9. The Role Assignment dropdown for HDFS roles Mount Points Mount points are key, to properly configure an optimized cluster. It s always best practice to be following the installation guide, and prior to starting HDFS or any of the services, make sure that the values set for dfs.data.dir (Figure 10) and mapred.data.dir are set to the appropriate mount points. In the case below, there is one mount point per physical spindle allocated. 19 Dell Intel Distribution for Apache Hadoop

Figure 10. Mount Points are configured below for the dfs.data.dir directories 20 Dell Intel Distribution for Apache Hadoop

7 Testing Setup HiBench Hibench is a Hadoop benchmark framework that consists of 9 typical workloads representing common Hadoop workloads. These consist of micro benchmarks, HDFS benchmarks, web search benchmarks, machine learning benchmarks, and data analytics benchmarks. For this paper the most well-known subset of the HiBench suite, the Teragen / Terasort benchmark, was employed to test system IO. Teragen / Terasort These two HDFS / MapReduce benchmarks are used in conjunction with each other to stress Hadoop systems and provide valuable metrics with regards to network, disk and CPU utilization. By starting with these as a baseline, Hadoop administrators can tune Hadoop s wide variety of parameters to get the desired performance. Teragen starts by generating flat text files that contain pseudo-random data that Terasort then sorts. This type of sort / shuffle exercise is similar to what is done over and over by customers as they manipulate data through MapReduce jobs. Tested Configuration In these tests a small Hadoop cluster was employed as recommended in Table 1. The specific software revisions used in the test are shown in Table 2. The PowerEdge R720 and R70XD hardware configurations are shown in Table 3 and Table 4. The hardware listed should be used as initial guidance only. Additional configurations are very possible and will likely be required as each customer s environment and use-case is unique. Table 2. Software Revisions Component Revision Redhat Enterprise Linux 6.4 Intel Distribution for Apache 2.4.1 (Build 16962) Hadoop Apache Hadoop (IDH is based 1.0.3 on) Hbase 0.94.1 Hive 0.9.0 Zookeeper 3.4.5 HiBench 2.2 Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration Component Detail Height 2 Rack Units (3.5 ) Processor 2x Intel Xeon E5-2650 2 GHz 8-core procs Memory 128 GB Disk 6x 600 GB 15K SAS Drives 21 Dell Intel Distribution for Apache Hadoop

Network RAID Controller Management Card 4x 1GbE LOMs, 2x 10GbE NICs PowerEdge RAID Controller H710 (PERC) Integrated Dell Remote Access Controller (idrac) Table 4. PowerEdge R720XD Datanode As Tested Configuration Component Detail Height 2 Rack Units (3.5 ) Processor 2x Intel Xeon E5-2667 2.9 GHz 6-core procs Memory 64 GB Disk 24x 500GB 7200 RPM Nearline SAS drives Network 4x 1GbE LOMs, 2x 10GbE NICs RAID Controller PowerEdge RAID Controller H710 (PERC) Management Card Integrated Dell Remote Access Controller (idrac) Tuning and Optimization of Workloads The cluster configuration variables used in these tests (Table 5) are simply a starting spot. Parameters like dfs.block.size would be highly contingent on the type of data being stored and the use-case thereof. A Dell Professional Services engagement is recommended to achieve configurations optimized for the user s workload. Table 5. Key Hadoop Configuration Parameters Name Value dfs.block.size 134217728 ipc.server.tcpnodelay FALSE ipc.client.tcpnodelay FASLE io.sort.factor 100 io.sort.mb 400 io.sort.spill.percent 0.8 io.sort.record.percent 0.05 mapred.child.java.opts 1024m mapreduce.tasktracker.outofband.heartbeat TRUE mapred.job.reuse.jvm.num.tasks 1 mapred.min.split.size 134217728 mapred.reduce.parallel.copies 20 mapred.reduce.tasks.speculative.execution TRUE mapred.reduce.tasks 30* # Task Trackers mapred.map.tasks 20 * # of Task Trackers mapred.compress.map.output TRUE tasktracker.http.threads 60 22 Dell Intel Distribution for Apache Hadoop