Dell Reference Configuration for Hortonworks Data Platform

Similar documents
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Intel Distribution for Apache Hadoop on Dell PowerEdge Servers

Hortonworks Data Platform Reference Architecture

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Microsoft Exchange 2010 on Dell Systems. Simple Distributed Configurations

Optimizing SQL Server Storage Performance with the PowerEdge R720

HP reference configuration for entry-level SAS Grid Manager solutions

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

Hadoop Size does Hadoop Summit 2013

Microsoft SharePoint Server 2010

Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers

DELL s Oracle Database Advisor

Dell Desktop Virtualization Solutions Simplified. All-in-one VDI appliance creates a new level of simplicity for desktop virtualization

Apache Hadoop Cluster Configuration Guide

Hadoop Cluster Applications

The Methodology Behind the Dell SQL Server Advisor Tool

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

A Dell White Paper Dell Compellent

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

IRON Big Data Appliance Platform for Hadoop

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

MapR Enterprise Edition & Enterprise Database Edition

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

FLOW-3D Performance Benchmark and Profiling. September 2012

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

High Performance SQL Server with Storage Center 6.4 All Flash Array

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

High Performance Tier Implementation Guideline

Power Comparison of Dell PowerEdge 2950 using Intel X5355 and E5345 Quad Core Xeon Processors

The Greenplum Analytics Workbench

Accelerating and Simplifying Apache

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

EMC Unified Storage for Microsoft SQL Server 2008

Maximum performance, minimal risk for data warehousing

Reference Architecture - Microsoft Exchange 2013 on Dell PowerEdge R730xd

Deploying Hadoop on SUSE Linux Enterprise Server

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

VMware ESX 2.5 Server Software Backup and Restore Guide on Dell PowerEdge Servers and PowerVault Storage

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Virtualized Hadoop. A Dell Hadoop Whitepaper. By Joey Jablonski. A Dell Hadoop Whitepaper

Reference Architecture for Dell VIS Self-Service Creator and VMware vsphere 4

Performance and Energy Efficiency of. Hadoop deployment models

DELL. Dell Microsoft Windows Server 2008 Hyper-V TM Reference Architecture VIRTUALIZATION SOLUTIONS ENGINEERING

Dual-Core Processors on Dell-Supported Operating Systems

Dell PowerEdge Blades Outperform Cisco UCS in East-West Network Performance

vrealize Business System Requirements Guide

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Platfora Big Data Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Microsoft SharePoint Server 2010

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Dell Virtual Remote Desktop Reference Architecture. Technical White Paper Version 1.0

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Business white paper. HP Process Automation. Version 7.0. Server performance

HP Reference Architecture for Hortonworks Data Platform on HP ProLiant SL4540 Gen8 Server

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Dell s SAP HANA Appliance

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Intel Platform and Big Data: Making big data work for you.

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

HadoopTM Analytics DDN

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Installing Hadoop over Ceph, Using High Performance Networking

Dell EqualLogic Best Practices Series

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Dell In-Memory Appliance for Cloudera Enterprise

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

NetApp Solutions for Hadoop Reference Architecture

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Enabling High performance Big Data platform with RDMA

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

1-Gigabit TCP Offload Engine

Big Data in the Enterprise: Network Design Considerations

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Deploying Hadoop with Manager

Big Fast Data Hadoop acceleration with Flash. June 2013

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

RED HAT ENTERPRISE LINUX 7

Fast, Low-Overhead Encryption for Apache Hadoop*

Transcription:

Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution Architect Dell Solution Centers Rob Wilbert Solution Architect Dell Solution Centers

Executive Summary This document details the configuration set-up for Hortonworks Data Platform (HDP) software on the PowerEdge R720XD. The intended audiences for this document are customers and system architects looking for information on configuring Apache Hadoop clusters within their information technology environment for big data analytics. The reference configuration introduces the server set-up that can run the Hortonworks stack. The document will only focus on configuration; it will not go into detail about Hadoop solution components or resiliency, performance, or software considerations. This document does not focus on best practices or complete architecture for a Hortonworks Data Platform Solution. Dell developed this document to help streamline configuration for the Hortonworks Data Platform software. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. 2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own. 2 Dell Reference Configuration for Hortonworks

Reference Configuration Hortonworks Data Platform is available on both Linux and, in partnership with Microsoft, on Windows. This initial configuration will target deployment on bare-metal servers running RedHat Linux 6.x. Server Roles Name Node(s) 1 Name nodes serve as control nodes for the HDFS, MapReduce, and HBase processes. For HDFS, name nodes own the block map and directory tree for all the data on the cluster. With MapReduce, the name node owns the job tracking daemon (JobTracker) that handles job execution and monitoring. Lastly, with HBase, name nodes are responsible for running the monitoring processes as well as owning any metadata operations. In addition to a primary name node, a secondary name node is strongly recommended for any deployment beyond a proof-of-concept. Data Node(s) Data nodes are the nodes that hold the data as well as execute MapReduce jobs. Data nodes are generally filled with large amounts of local disk, enabling the parallel processing and distributed storage features of Hadoop. The number of data nodes is dictated by use case. Adding additional data nodes increases both performance and capacity simultaneously. Maintaining a 1:1 ratio of CPU cores to disk spindles can be important in many high I/O workloads. Edge Node(s) Edge nodes lie on the perimeter of the dedicated Hadoop network and bridge the Hadoop environment with the production IT environment. Edge nodes enable external users and business processes to interact with the cluster. Additional edge nodes may be added to the Hadoop cluster as external access requirements increase. Ambari Manager Node The Ambari management node is where the Ambari server resides. The Ambari management node runs the configuration management processes, web server software, monitoring software (open-source project Nagios) and performance monitoring (open-source project Ganglia) software. In a production environment, the Ambari server should run on a dedicated node; however, for the purposes of this document, Ambari server was installed on the edge node. 1 In Hortonworks terminology the Name Node can be referred to as the Master Node 3 Dell Reference Configuration for Hortonworks

Figure 1. Dell Big Data Cluster Logical Diagram Node Count Recommendations Dell recognizes that use-cases for Hadoop range from small development clusters all the way through large multi petabyte production installations. Dell has a Professional Services team that sizes Hadoop clusters for a customer s particular use. As a starting point, three cluster configurations can be defined for typical use: Minimum Development Cluster The minimum development cluster is targeted at functional testing and may even be built from existing equipment; however, the performance of these types of clusters can be significantly less as development clusters typically do not benefit from the highly distributed nature of HDFS. Recommended Small Cluster The recommended small cluster is a good starting point for customers taking the initial steps for running HDP in production. A small cluster provides some layers of basic resiliency that is expected in today s production IT world. Recommended Production Cluster The recommended production cluster configuration provides dense storage and compute capacity, coupled with high degree of resiliency. The production cluster allows for an adequate number of data nodes to demonstrate the performance benefits of distributed storage and parallel computing. 4 Dell Reference Configuration for Hortonworks

Table 1. Recommended Cluster Sizes Minimum Development Cluster 3 Recommended SmallCluster 3 Recommended Production Cluster Name Node(s) 1 1 1 2 2 Job Tracker(s) 0 1 0 2 1 Edge Node(s) 0 1 1 2 1 Data Node(s) 3 6 14 Ambari Management 0 1 0 2 1 Node 1 GbE Switches 1 1 2 10 GbE Switches 0 2 2 Rack Units 9U 19U 42U 1 In this case a single node serves as the name node, job tracker, edge node and Ambari management node. 2 In this case the Ambari management node, job tracker, and edge node roles are combined. 3 Configurations include high availability and resiliency which is recommended for production clusters, proof of concepts and small cluster can exclude high availability and resiliency Figure 2. Reference Configuration Diagram 5 Dell Reference Configuration for Hortonworks

Figure 3. Ambari Manager - Node Installation Tested Configuration For the purposes of this document, a small Hadoop cluster was deployed as recommended in Table 1. The specific software revisions used in the test are shown in Table 2. The PowerEdge R720 and R70XD hardware configurations we tested are shown in Table 3 and Table 4. The hardware listed should be used as initial guidance only. Additional configurations are possible and will likely be required as each customer s environment and use-case is unique. Common parameters that could differ include: 1. Processors Higher frequencies and core counts may improve performance while lower voltage/tdp processors, such as the Intel Xeon E5-2630L processor, can improve power efficiency 2. Local Storage Disk capacity, drive technology, and spindle speed can be matched to budget and performance requirements as necessary 3. Memory Depending on the usage of various services (Hbase versus Map Reduce) more or less memory may be necessary on both the infrastructure and data nodes Teragen / Terasort These two HDFS / MapReduce benchmarks are used in conjunction with each other to stress Hadoop systems and provide valuable metrics with regards to network, disk and CPU utilization. By starting with these benchmarks as a baseline, Hadoop administrators can tune Hadoop s wide variety of parameters to achieve the desired performance. Teragen starts by generating flat text files that contain pseudo-random data that Terasort then sorts. This type of sort / shuffle exercise simulates customer workloads as they manipulate data through MapReduce jobs. 6 Dell Reference Configuration for Hortonworks

Figure 4. Ambari Manager Monitoring Table 2. Software Revisions (As Tested) Component Revision Redhat Enterprise Linux 6.4 Hortonworks Data Platform Hadoop 1.3 Hadoop 1.2.0 Table 3. PowerEdge R720 Infrastructure Node Configuration (As Tested) Component Specification Height 2 Rack Units (3.5 ) Processor 2x Intel Xeon E5-2650 2 GHz 8-core processors Memory 128 GB Disk 6x 600 GB 15K SAS Drives Network 4x 1GbE Intel LOMs, 2x 10GbE Intel NICs RAID Controller PowerEdge RAID Controller H710 (PERC) Management Card Integrated Dell Remote Access Controller (idrac) Table 4. PowerEdge R720XD Data Node Configuration (As Tested) Component Specification Height 2 Rack Units (3.5 ) Processor 2x Intel Xeon E5-2667 2.9 GHz 6-core processors Memory 64 GB Disk 24x 500GB or 1TB 7200 RPM Nearline SAS drives Network 4x 1GbE Intel LOMs, 2x 10GbE Intel NICs RAID Controller PowerEdge RAID Controller H710 (PERC) Management Card Integrated Dell Remote Access Controller (idrac) 7 Dell Reference Configuration for Hortonworks

Dell Solution Centers The Dell Solution Centers are a global network of connected labs that allow Dell to help customers architect, validate and build solutions. With multiple footprints in every region, they help customers understand anything from simple hardware platforms, to more complex solutions. These engagements range from an informal 30-60 minute briefing, through a longer half-day workshop, and on to a proof-of-concept that allow customers to kick the tires of their solution prior to signing on the dotted line. Customers may engage with their account team and have them submit a request to take advantage of these free services. Links Hortonworks http://hortonworks.com Hortonworks Data Platform - http://hortonworks.com/products/hdp/ Hortonworks Sandbox - http://hortonworks.com/products/hortonworks-sandbox/ 8 Dell Reference Configuration for Hortonworks