Deploying Hadoop on Lustre Sto rage:



Similar documents
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Experiences with Lustre* and Hadoop*

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Lustre * Filesystem for Cloud and Hadoop *

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Understanding Hadoop Performance on Lustre

Bright Cluster Manager

Lustre & Cluster. - monitoring the whole thing Erich Focht

Dell Reference Configuration for Hortonworks Data Platform

New Storage System Solutions

MapR Enterprise Edition & Enterprise Database Edition

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

High-Availability and Scalable Cluster-in-a-Box HPC Storage Solution

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Scala Storage Scale-Out Clustered Storage White Paper

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

THE HADOOP DISTRIBUTED FILE SYSTEM

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Installing Hadoop over Ceph, Using High Performance Networking

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Storage Architectures for Big Data in the Cloud

Architecting a High Performance Storage System

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Enabling High performance Big Data platform with RDMA

Current Status of FEFS for the K computer

Lustre SMB Gateway. Integrating Lustre with Windows

Hortonworks Data Platform Reference Architecture

POSIX and Object Distributed Storage Systems

Application Performance for High Performance Computing Environments

FLOW-3D Performance Benchmark and Profiling. September 2012

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

HPCHadoop: MapReduce on Cray X-series

Lessons learned from parallel file system operation

GraySort on Apache Spark by Databricks

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Accelerating and Simplifying Apache

February, 2015 Bill Loewe

Introduction to Gluster. Versions 3.0.x

Mambo Running Analytics on Enterprise Storage

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

Dell In-Memory Appliance for Cloudera Enterprise

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Building Storage Service in a Private Cloud

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

High Performance Computing OpenStack Options. September 22, 2015

Hadoop Architecture. Part 1

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Object storage in Cloud Computing and Embedded Processing

Spectrum Scale HDFS Transparency Guide

A Performance Analysis of Distributed Indexing using Terrier

Intel RAID SSD Cache Controller RCS25ZB040

Distributed File Systems

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

Structured data meets unstructured data in Azure and Hadoop

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

Bright Cluster Manager

Step-by-Step Guide to Open-E DSS V7 Active-Active Load Balanced iscsi HA Cluster

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Cloudera Manager Training: Hands-On Exercises

Introduction to HDFS. Prasanth Kothuri, CERN

FlexArray Virtualization

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Can High-Performance Interconnects Benefit Memcached and Hadoop?

HDFS Architecture Guide

Private cloud computing advances

Big Fast Data Hadoop acceleration with Flash. June 2013

Lustre failover experience

Cray Lustre File System Monitoring

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Apache Hadoop. Alexandru Costan

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

How To Write An Article On An Hp Appsystem For Spera Hana

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Performance Analysis of Mixed Distributed Filesystem Workloads


Apache Hadoop Cluster Configuration Guide

Building a Top500-class Supercomputing Cluster at LNS-BUAP

A Shared File System on SAS Grid Manger in a Cloud Environment

Hadoop on the Gordon Data Intensive Cluster

Transcription:

Deploying Hadoop on Lustre Sto rage: Lessons learned and best practices. J. Mario Gallegos- Dell, Zhiqi Tao- Intel and Quy Ta- Dell Apr 15 2015

Integration of HPC workloads and Big Data w orkloads on Tulane s new Cypress HPC cluster Big Data problems are large, HPC resources can help HPC Brings more compute power HPC storage can provide more capacity and data throughput Requisites/ constraints HPC Compute nodes local storage space only for OS. Integration with non- Hadoop applications and storage Need something that is compatible with POSIX Map- Reduce should access storage efficiently Fast storage access(40gbe) and fully supported Solution had to be designed, integrated, deployed and tested in 7 weekdays to be ready for SC14 2

Tulane University Cypress HPC hybrid cluster 3

Why Hadoop Map/Reduce on Lustre? Effective Data Processing High Performance Storage Hadoop Lustre Hadoop Applic atio ns Map/Reduce MGMT Visibility Sc alability Perfo rm ance

Contrasting HDFS vs. Lustre Hadoop = MapReduce + HDFS LUSTRE HDFS Computations share the data Uses LNET to access data No data replication (uses RAID) Centralized storage POSIX compliant Widely used for HPC applications All data available at all time. Allows backup/recovery using existing infrastructure (HSM). Data moves to the computation Uses HTTP for moving data Data replication (3X typical) Local storage Non- POSIX Compliant Widely used for MR applications Used during shuffle MR phase 5

Integration of HPC w orkloads and Big Data w orkloads on Tulane s new HPC cluster MapReduce workloads added to HPC Unknown mix of Hadoop & HPC applications from multiple fields, changing over time Hadoop workloads run differently than typical HPC applications Hadoop and HPC applications use different schedulers Need to run MapReduce workloads just like HPC workloads More requisites/ constraints Train or hire admins to manage Hadoop Lustre also re quire training o r hiring adm ins O R sim plify/ auto m ate/ assist de plo ym e nt, adm inistratio n and m o nito ring to avo id/ m inim ize training/ hiring. 6

Design and initial implementation Rely on partners and existing products to meet deadline Reuse existing Dell Lustre based storage solutions A Lustre training system was appropriated for the POC Storage System fully racked/deployed/tested and shipped Plan was: connect power and 40 GbE and add Hadoop onsite A box with all types cables, spare adapters, other adapters and tools was included Intel was chosen as the partner for Lustre Intel EE for Lustre was already used in a Lustre solution Included the components needed to replace HDFS A replacement for YARN was in the making Bright was chosen for cluster/ storage management Already used to deploy/manage/monitor Cypress HPC cluster Supports Cloudera distribution of Hadoop Supports deployment of Lustre clients on Dell s clusters 7

Proof Of Concept components Intel Enterprise Edition for Lustre FAILOVER (HA) ACTIVE FAILOVER (HA) Management Server R620 - E5-2660v2 128 GiB DDR3 1866 MT/s ACTIVE PASSIVE Object Storage Server #1 R620 -E5-2660v2 128GiB DDR3 1866 MT/s Object Storage Server #2 R620 -E5-2660v2 128GiB DDR3 1866 MT/s ACTIVE MetaData Server #1 R620 E5-2660v 128GiB DDR3 1866MT/s MetaData Server #2 R620 E5-2660v 128GiB DDR3 1866MT/s 6Gb/s SAS MD3260 60 4TB NLS 6Gb/s SAS 6Gb/s SAS MD3260 60-3TB NLS 6Gb/s SAS MD3220 24 300GiB SAS 15K MD3060e 60 3TB NLS MD3060e 60 3TB NLS 8

Proof of Concept Setup Management Target (MGT) Metadata Target (MDT) Object Storage Targets (OSTs) Management Network Metadata Servers Object Storage Servers 10GbE Management Server Bright 7.0 CDH 5.1.2 Hadoop adapter for Lustre Headnode 40GbE Tulane s Cypress HPC cluster HPC/Hadoop compute nodes

Hadoop Adapter for Lustre - HAL Replaces HDFS Based on the Hadoop architecture Packaged as a single Java library (JAR) Classes for accessing data on Lustre in a Hadoop compliant manner. Users can configure Lustre Striping. Classes for Null Shuffle, i.e., shuffle with zero-copy Easily deployable with minimal changes in Hadoop configuration No change in the way jobs are submitted Part of Intel Enterprise Edition for Lustre

HPC Adapter for MapReduce - HAM Replaces YARN (Yet Another Resource Negotiator) SLURM based (Simple Linux Utility for Resource Management) Widely used open source resource manager Objectives No modifications to Hadoop or its APIs Enable all Hadoop applications to execute without modification Maintain license separation Fully and transparently share HPC resources Improve performance New to Intel Enterprise Edition for Lustre

Hadoop MapReduce on Lustre. How? org.apache.hadoop.fs FileSystem RawLocalFileSystem LustreFileSystem Use Hadoop s built-in LocalFileSystem class to add the Lustre file system support Native file system support in Java Extend and override default behavior: LustreFileSystem Defined new URL scheme for Lustre, i.e. lustre:/// Controls Lustre striping info Resolves absolute paths to userdefined directories Optimize the shuffle phase Performance improvement

Hadoop Nodes Management Bright 7.0 was used to provision the 124 Dell C8220X compute nodes from bare metal Used to deploy CDH 5.1.2 onto eight Hadoop nodes, not running HDFS The Lustre clients and IEEL plug- in for Hadoop were deployed by Bright on the Hadoop nodes The Hadoop nodes could access Lustre storage Different mix of Hadoop or HPC nodes can be deployed on demand using Bright 13

On Sit e work and challenges Dell, Intel and Bright send people onsite for the last 5 days and allocated resources to support remotely Rack with POC system took longer to arrive than planned Location assigned to POC system was too far from HPC cluster than anticipated, 40 GbE cables were too short New requirem ent: Relocation of POC system closer to the cluster Z9500 40 GbE switch only had 1 open port. POC system needed four A breakout cable and four 10 GbE cables were used Z9500 port had to be configured for breakout cabling IB FDR/40 GbE Mellanox adapters were replaced by 10 GbE After connecting the POC system, one of the servers had an idrac failure (everything was working during the initial testing in our lab idrac is required for HA, to enforce node eviction A replacement system was expedited, but ETA was one day We decided to swap the management server and affected system, since all servers were R620s, sam e CPU/RAM/local disk POC system w as redeployed from scratch (OS and Intel softw are)

On Sit e work and challenges Local and remote resources started working immediately on Hadoop integration and client deployment, testing Lustre concurrently. Since POC had to use only 10 GbE, performance was limited to a maximum of 2 GB/s (2 OSS links). Focus changed to demonstrating feasibility, capabilities and GUI monitoring, but performance was also reported. All on site work was completed in four days, with last day to spare. Tulane University Execs and Staff support was invaluable to adjust to initial problems and expedite their solution.

POC Perform ance Performance was reported without tuning; there was plenty of room for improvement. Starting by using 40 GbE as originally planned. DFSio reported an average of 134.91 MB/sec (12 files, 122880 MB). Lustre reported max aggregated read/write throughput: 1.6 GB/s. TeraSort on 100 Million records: Total time spent by all maps in occupied slots (ms)=341384 Total time spent by all reduces in occupied slots (ms)=183595 Total time spent by all map tasks (ms)=341384 Total time spent by all reduce tasks (ms)=183595 Total vcore-seconds taken by all map tasks=341384 Total vcore-seconds taken by all reduce tasks=183595 Total megabyte-seconds taken by all map tasks=349577216 Total megabyte-seconds taken by all reduce tasks=188001280

Hadoop on Lustre Monitoring 17

Lessons learned and best practices Keeping in mind the extremely aggressive schedule: Allocating in advance the right and enough resources was key under time constraints. Without our partners full support, this POC could not happened. Reuse of existing technology (solution components) and equipment was crucial. Gather in advance all possible information about existing equipment, power and networks on site, as detailed as possible. Get detailed agreement in advance about location for new systems and their requirements. Use figures and photographs to clarify everything. Murphy never takes a brake and hardware may fail at any time, even if it was just tested.

Lessons learned and best practices Get in advance user accounts, lab access, remote access, etc. and if possible, get Lab decision-makers aware of work. Conference calls with remote access to show and sync all parties involved is a necessity. Under large time zones disparity, flexibility is gold. Have enough spare parts: adapters, hard disks, cables and tools but you may need more. Using interchangeable servers proved to be an excellent choice. Division of assignments by competence and remote support: priceless. Single chain of command for assignments and frequent status updates were very valuable. Communicate, communicate, communicate!

Dell: Joe Stanfield Onur Celebioglu Nishanth Dandapanthu Jim Valdes Brent Strickland. Intel: Karl Cain Bright Computing: Robert Stober Michele Lamarca Martijn de Vries Tulane: Charlie McMahon Leo Tran Tim Deeves Olivia Mitchell Tim Riley Michael Lambert Acknowledgements

Questions?

Thank you! www.dellhpcsolutions.com www.hpcatdell.com

23