Hadoop on the Gordon Data Intensive Cluster

Similar documents

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

An Introduction to the Gordon Architecture

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

MapReduce Evaluator: User Guide

Cloud Data Center Acceleration 2015

SMB Direct for SQL Server and Private Cloud

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Big Fast Data Hadoop acceleration with Flash. June 2013

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Cluster Implementation and Management; Scheduling

Enabling High performance Big Data platform with RDMA

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

ebay Storage, From Good to Great

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Building a Scalable Storage with InfiniBand

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

FLOW-3D Performance Benchmark and Profiling. September 2012

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

HPC Update: Engagement Model

Zadara Storage Cloud A

Hadoop Size does Hadoop Summit 2013

Hortonworks Data Platform Reference Architecture

Installing Hadoop over Ceph, Using High Performance Networking

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

A parallel file system made in Germany Tiered Storage and HSM

Accelerate Big Data Analysis with Intel Technologies

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Performance measurement of a Hadoop Cluster

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Parallel Programming Survey

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Intel RAID SSD Cache Controller RCS25ZB040

Hadoop: Embracing future hardware

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Flash Performance for Oracle RAC with PCIe Shared Storage A Revolutionary Oracle RAC Architecture

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

Data Center Storage Solutions

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Introduction to SDSC systems and data analytics software packages "

Sun Constellation System: The Open Petascale Computing Architecture

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

LS DYNA Performance Benchmarks and Profiling. January 2009

Building Enterprise-Class Storage Using 40GbE

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

ECLIPSE Performance Benchmarks and Profiling. January 2009

præsentation oktober 2011

Benchmarking Hadoop & HBase on Violin

Accelerating and Simplifying Apache

Utilizing the SDSC Cloud Storage Service

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Architecting a High Performance Storage System

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

U"lizing the SDSC Cloud Storage Service

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

SCI Briefing: A Review of the New Hitachi Unified Storage and Hitachi NAS Platform 4000 Series. Silverton Consulting, Inc.

The Greenplum Analytics Workbench

Accelerating Server Storage Performance on Lenovo ThinkServer

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Advancing Applications Performance With InfiniBand

Solid State Storage in Massive Data Environments Erik Eyberg

White Paper Solarflare High-Performance Computing (HPC) Applications

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Dell Reference Configuration for Hortonworks Data Platform

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

HP 3PAR StoreServ 8000 Storage - what s new

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

RDMA for Apache Hadoop User Guide

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

Current Status of FEFS for the K computer

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

Transcription:

Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18, 2012 WBDB2012.in

Gordon An Innovative Data-Intensive Supercomputer Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others Emphasizes memory and IO over FLOPS. Appro integrated 1,024 node Sandy Bridge cluster 64 GB per node with 16 core Sandy Bridge 300 TB of high performance Intel flash Large memory supernodes via vsmp Foundation from ScaleMP 3D torus interconnect from Mellanox In production operation since February 2012 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)

Gordon Design: Two Driving Ideas Observation #1: Data keeps getting further away from processor cores ( red shift ) Do we need a new level in the memory hierarchy? Observation #2: Many data-intensive applications are serial and difficult to parallelize Would a large, shared memory machine be better from the standpoint of researcher productivity for some of these? Rapid prototyping of new approaches to data analysis

The Memory Hierarchy of a Typical Supercomputer Shared memory Programming (single node) Message passing programming Latency Gap BIG DATA Disk I/O

The Memory Hierarchy of Gordon Shared memory Programming (vsmp) BIG DATA Disk I/O

Gordon Design Highlights 1,024 2S Xeon E5 (Sandy Bridge) nodes 16 cores, 64 GB/node Intel Jefferson Pass mobo PCI Gen3 3D Torus Dual rail QDR 300 GB Intel 710 emlc SSDs 300 TB aggregate 64, 2S Westmere I/O nodes 12 core, 48 GB/node 4 LSI controllers 16 SSDs Dual 10GbE SuperMicro mobo PCI Gen2 Large Memory vsmp Supernodes 2TB DRAM 10 TB Flash Data Oasis Lustre PFS 100 GB/sec, 4 PB

(Some) SSDs are a good fit for data-intensive computing Flash Drive Typical HDD Good for Data Intensive Apps Latency <.1 ms 10 ms Bandwidth (r/w) 270 / 210 MB/s 100-150 MB/s IOPS (r/w) 38,500 / 2000 100 Power consumption (when doing r/w) 2-5 W 6-10 W Price/GB $3/GB $.50/GB -. Endurance 2-10PB N/A Total Cost of Ownership Jury is still out.

Hadoop Overview Hadoop framework extensively used for scalable distributed processing of large datasets. High interest from user base. Gordon architecture presents potential performance benefits with high performance storage (SSDs) and high speed network (QDR IB). Storage options viable for Hadoop on Gordon SSD via iscsi Extensions for RDMA (iser) Lustre filesystem (Data Oasis), persistent storage Approach to running Hadoop within the scheduler infrastructure myhadoop (developed at SDSC).

Overview (Contd.) Network options 1 GigE low performance IPoIB Results DFS Benchmark runs done on both SSD and Lustre. TeraSort Scaling study with sizes 100GB-1.6TB, 4-128 nodes. Future/Ongoing Work UDA - Unstructured Data Accelerator from Mellanox http://www.mellanox.com/related-docs/applications/sb_hadoop.pdf Lustre for Hadoop Can provide another persistent data option.

Gordon System Specification INTEL SANDY BRIDGE COMPUTE NODE Sockets 2 Cores 16 Clock speed 2.6 DIMM slots per socket 4 DRAM capacity 64 GB INTEL FLASH I/O NODE NAND flash SSD drives 16 SSD capacity per drive/capacity per node/total 300 GB / 4.8 TB / 300 TB Flash bandwidth per drive (read/write) Flash bandwidth per node (write/read) 270 MB/s / 210 MB/s 4.3 /3.3 GB/s SMP SUPER-NODE Compute nodes 32 I/O nodes 2 Addressable DRAM 2 TB Addressable memory including flash 12TB GORDON Compute Nodes 1,024 Total compute cores 16,384 Peak performance 341TF Aggregate memory 64 TB INFINIBAND INTERCONNECT Aggregate torus BW 9.2 TB/s Type Dual-Rail QDR InfiniBand Link Bandwidth 8 GB/s (bidirectional) Latency (min-max) 1.25 µs 2.5 µs DISK I/O SUBSYSTEM Total storage 4.5 PB (raw) I/O bandwidth 100 GB/s File system Lustre Gordon System Specification

Subrack Level Architecture 16 compute nodes per subrack

Primary Storage Option for Hadoop: One SSD per Compute Node (only 4 of 16 compute nodes shown) Lustre Logical View Compute Node Compute Node Compute Node Compute Node SSD SSD SSD SSD File system appears as: /scratch/$user/$pbs_jobid The SSD drives on each I/O node are exported to the compute nodes via the iscsi Extensions for RDMA (iser). One 300 GB flash drive exported to each compute node appears as a local file system For the Hadoop cluster, 300 GB of disk available as datanode Each compute node has 2 IB connections iser and Lustre use rail 1 Hadoop network (IPoIB or UDA) uses rail 0 Lustre parallel file system is mounted identically on all nodes with aggregate BW of 100 GB/sec This exported SSD storage can serve as the local storage used by HDFS Network BW over rail 1 and the aggregate I/O BW from the 16 SSD drives are well matched No network bottleneck in physically locating the flash drives in the I/O node

3D Torus of Switches Linearly expandable Simple wiring pattern Short Cables- Fiber Optic cables generally not required Lower Cost :40% as many switches, 25% to 50% fewer cables Works well for localized communication Fault Tolerant within the mesh with 2QoS Alternate Routing Fault Tolerant with Dual-Rails for all routing algorithms 3 rd dimension wrap-around not shown for clarity

Gordon Network Architecture XSEDE & R&E Networks Data Movers (4x) Mgmt. Nodes (2x) SDSC Network Mgmt. Edge & Core Ethernet Public Edge & Core Ethernet Login Nodes (4x) Dual-rail IB Dual 10GbE storage GbE management GbE public Round robin login Mirrored NFS Redundant front-end NFS Server (4x) IO Nodes Compute Node Compute Node Compute Node Compute Node 1,024 Data Oasis Lustre PFS 4 PB IO Nodes 64 3D torus: rail 1 3D torus: rail 2 GbE 2x10GbE 10GbE QDR 40 Gb/s

myhadoop Integration with Gordon Scheduler myhadoop* was developed at SDSC to help run Hadoop within the scope of the normal scheduler. Scripts available to use the node list from the scheduler and dynamically change Hadoop configuration files. mapred-site.xml masters slaves core-site.xml Users can make additional hadoop cluster changes if needed Nodes are exclusive to the job [does not affect the rest of the cluster]. *myhadoop link: http://sourceforge.net/projects/myhadoop/

Hadoop on Gordon Storage Options Primary storage option is to use SSDs which are available via iscsi Extensions for RDMA (iser) on each compute node. Prolog creates job specific SSD filesystem location which can be used for HDFS storage. myhadoop framework leveraged to do this dynamically. Lustre storage is available for persistent option. DFS benchmark results show very good performance comparable to SSD storage.

Hadoop on Gordon Network Options All Gordon compute nodes are dual QDR Infiniband connected. Additionally, a 1GigE connection is available. For a standard Apache Hadoop install IPoIB using one of the default QDR links is the best network option. myhadoop setup uses the ib0 addresses in dynamic configuration set up. Mellanox's Unstructured Data Accelerator (UDA) is another option providing the middleware interface directly to high speed IB link. We are currently testing this on Gordon.

Average I/O (MB/s) Hadoop on Gordon TestDFSIO Results 180 160 140 120 100 80 60 Ethernet - 1GigE IPoIB UDA 40 20 0 0 1 2 3 4 5 6 7 8 9 File Size (GB) Results from the TestDFSIO benchmark. Runs were conducted using 4 datanodes and 2 map tasks. Average write I/O rates are presented.

Hadoop on Gordon TeraSort Results (a) (b) Results from the TeraSort benchmark. Data generated using teragen and is then sorted. Two sets or runs were performed using IPoIB, SSD storage (a) Fixed dataset size of 100GB, number of nodes varied from 4 to 32, (b) Increasing dataset size with number of nodes.

Preliminary Hadoop on Gordon Tuning Based on the Optimizing Hadoop Deployments whitepaper by Intel, October 2010, version 2.0 Managed using the Rocks cluster management software, with a CentOS distribution (kernel version 2.6.34.7-1), and a custom mellanox ofed stack (v. 1.5.3) designed to enable the iser export of SSD resources. The Hadoop cluster tests were done using Apache Hadoop version 1.0.4. The following three network options were used for the Hadoop cluster 1) 1 GigE ethernet, 2) IPOIB using one Infiniband rail, and 3) UDA version 3.0. In addition to the storage and network options, tuning of HDFS parameters (dfs.block.size), map/reduce parameters (io.sort.factor, io.sort.mb, and io.sort.record etc), general configuration parameters (dfs.namenode.handler.count, and mapred.job.tracker.handler.count etc) was also considered.

Summary Hadoop can be dynamically configured and run on Gordon using myhadoop framework. High speed Infiniband network allows for significant gains in performance using SSDs. TeraSort benchmark shows good performance and scaling on Gordon. Future improvements/tests include using Mellanox UDA for terasort benchmark use lustre filesystem for HDFS storage