High Performance NAS for Hadoop

Similar documents

Accelerating and Simplifying Apache

WHITE PAPER BRENT WELCH NOVEMBER

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

HPC Advisory Council

PARALLELS CLOUD STORAGE

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Lab Validation Report

Panasas: High Performance Storage for the Engineering Workflow

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop: Embracing future hardware

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Storage Architectures for Big Data in the Cloud

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Maginatics Cloud Storage Platform for Elastic NAS Workloads

MagFS: The Ideal File System for the Cloud

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop IST 734 SS CHUNG

Installing Hadoop over Ceph, Using High Performance Networking

Scala Storage Scale-Out Clustered Storage White Paper

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Hadoop Architecture. Part 1

HPC Storage Solutions at transtec. Parallel NFS with Panasas ActiveStor

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

RAID for the 21st Century. A White Paper Prepared for Panasas October 2007

SOLID STATE DRIVES AND PARALLEL STORAGE

Scalable Performance of the Panasas Parallel File System

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

Enabling High performance Big Data platform with RDMA

Apache Hadoop FileSystem Internals

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Scalable Performance of the Panasas Parallel File System

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Quick Reference Selling Guide for Intel Lustre Solutions Overview

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

Apache HBase. Crazy dances on the elephant back

The Design and Implementation of the Zetta Storage Service. October 27, 2009

THE HADOOP DISTRIBUTED FILE SYSTEM

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Long term retention and archiving the challenges and the solution

Scientific Computing Data Management Visions

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Understanding Enterprise NAS

NoSQL and Hadoop Technologies On Oracle Cloud

EMC ISILON AND ELEMENTAL SERVER

Introduction to Gluster. Versions 3.0.x

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Quantum StorNext. Product Brief: Distributed LAN Client

SciDAC Petascale Data Storage Institute

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big data management with IBM General Parallel File System

The functionality and advantages of a high-availability file server system

Hadoop Size does Hadoop Summit 2013

POSIX and Object Distributed Storage Systems

Design and Evolution of the Apache Hadoop File System(HDFS)

Big Data in the Enterprise: Network Design Considerations

Integrated Grid Solutions. and Greenplum

Proact whitepaper on Big Data

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

IBM ELASTIC STORAGE SEAN LEE

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Chapter 7. Using Hadoop Cluster and MapReduce

NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS

IBM System x GPFS Storage Server

Quantcast Petabyte Storage at Half Price with QFS!

This article is the second

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HadoopTM Analytics DDN

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Symantec Backup Appliances

Scalable Architecture on Amazon AWS Cloud

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE WORKSHOP SC11

Dell Reference Configuration for Hortonworks Data Platform

Four Reasons To Start Working With NFSv4.1 Now

IBM System x GPFS Storage Server

Selling Compellent NAS: File & Block Level in the Same System Chad Thibodeau

Big Data Trends and HDFS Evolution

Transcription:

High Performance NAS for Hadoop HPC ADVISORY COUNCIL, STANFORD FEB 8, 2013 DR. BRENT WELCH, CTO, PANASAS Panasas and Hadoop

PANASAS TECHNICAL DIFFERENTIATION Scalable Performance Balanced object-storage building block [8TB SATA, 120GB SSD, 8GB RAM, 1 core, dual GE] 40 TB to 8 PB single system supporting 100 s to 1000 s of active clients Novel Integrity Protection File system and RAID are integrated Highly reliable data w/ novel data protection systems Maximum Availability Built-in distributed system platform manages 100 s of blades Simple to Deploy and Maintain Integrated storage system with appliance model Application Acceleration Customer proven results Standards Based pnfs, OSD ActiveStor 14 Panasas and Hadoop 2

ACTIVESTOR BLADE HARDWARE Dual Power Supplies + Battery 4u Dual 10GE uplinks Scalable Metadata Enterprise SATA + SSD => OSD Panasas and Hadoop 3

PANASAS SYSTEM VIEW Complete appliance solution (HW + SW), blade form factor DirectorBlade = metadata server StorageBlade = OSD Clustered, fault tolerant metadata services Linux kernel module for parallel I/O DirectFlow, or pnfs Object Storage Snapshots, Quota Global namespace NFS & CIFS re-export NFS/CIFS Client DirectorBlade 100+ Storage Blade 1000+ Nodes RPC SysMgr PanFS Client OSDFS 10,000+ iscsi/osd Panasas 4 and Hadoop 4

PANASAS PARALLEL DATA PATH path by-passes RAID controllers and metadata servers Application writes data DirectFlow/pNFS client layer generates redundant data for each stripe Everything is written directly to storage All blades work together on RAID rebuild Client Client Client Client Client Client Ethernet Network Panasas and Hadoop 5

MB/sec PANASAS PARALLEL ADVANTAGE Scale-out storage system with true parallel architecture Scale performance and capacity at the same time Rapid recovery from failure shared RAID responsibility 4 Shelves are 4 times faster than 1 12 Shelves rebuild 12 times faster than 1 2500 Shelf Scaling 140 MB/sec Rebuild 2000 1500 120 100 80 One Volume, 1G Files One Volume, 100MB Files N Volumes, 1GB Files N Volumes, 100MB Files 1000 Write 4 shelves 16 clients 60 Write 2 shelves 8 clients 500 Write 1 shelf 8 clients 0 0 16 32 48 64 80 96 112 128 144 IOR processes 3.4 testing December 2008, PAS 8 10GE 40 20 0 0 2 4 6 8 10 12 14 # Shelves Panasas and Hadoop 7

MB/sec SCALABLE BANDWIDTH 14000 Shelf Scaling Nov 2012, 5.0.0 12000 10000 8000 6000 Write Aggregate Read Aggregate Write Per Shelf Read Per Shelf 8 Shelves are 8 times faster than 1 4000 2000 0 0 1 2 3 4 5 6 7 8 9 # Shelves, 80-procs per shelf Testing Nov, 2012, AS-12 & AS-14, Rel 5.0.0 Panasas and Hadoop 8

HIGH PERFORMANCE NAS FOR HADOOP Panasas and Hadoop 9

HADOOP HW ENVIRONMENT Low cost hardware, run until failure, offline service Network infrastructure often oversubscribed Panasas and Hadoop 10

HADOOP SW ENVIRONMENT Hadoop environment is open Java implementation of a family of data and compute facilities Hadoop job scheduler for Map/Reduce applications HDFS file system Zookeeper configuration management NoSQL key-value stores layered over HDFS Query languages Many more Panasas and Hadoop 11

LIMITATIONS OF THE ENVIRONMENT Classic HW config mixes compute and data, with weak network Motivates function shipping instead of data shipping Even so, local access to data is not always possible Triplication is an expensive way to do data protection Not easy to share HDFS data with normal applications Classic model grew up in an environment skewed by Google requirements Very different than classic HPC environment Panasas and Hadoop 12

DEDICATED COMPUTE AND STORAGE Separating compute and storage demands a high quality network is shared among different compute clusters Hardware replacement cycles for compute and storage differ Network OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD NFS4.1 Metadata service Panasas and Hadoop 13

HIGH PERFORMANCE NAS FOR HADOOP A fast network and a good, scalable parallel file system Keep compute and data management separate Mixed workflows with different kinds of application sharing data Performance intuition A local disk goes at 50 to 100 MB/sec (large sequential workloads) A good network file system can deliver 500-1000+ MB/sec to one client A local SSD can deliver 250 to 2500 MB/sec Tuning Map/Reduce is more about partitioning a problem so it fits into main memory of the nodes Management intuition scattered among compute nodes makes them heavy Hard to upgrade compute w/out affecting storage Serviceability model of many hard drives or expensive PCIe card in every compute node is not very good Panasas and Hadoop 14

COMPARING PANFS AND HDFS Availability Triple Replication File system support Hardware Hadoop Panasas Comment Object RAID Panasas at 15% overhead vs. 200% Proprietary POSIX Panasas files can be shared with other big data workloads and Storage scale together Applications Single task - Hadoop analytics Multi-client write to file Not allowed - WORM and Storage independent Multi-purpose workloads Supported Write many Panasas allows independent scaling of compute and storage Panasas designed for many big data workloads Panasas big data workloads require concurrent file access by multiple clients Small File No Yes Panasas well suited to mixed big data workloads Panasas and Hadoop 15

ENTERPRISE HADOOP ENVIRONMENT Reliable, trusted enterprise storage Panasas storage offers enterprise class features such as snapshots, user quotas, service and IT administration Panasas allows users to scale computing and storage independently Features such as load balancing ensure all nodes are equally capable of participating in data transfers Storage can be added to a live system and dynamically integrated into the available pool management and data retention Supports data migration, old data can be moved to archives It can integrate into with existing data management systems Hadoop lacks any built-in data migration other than replication the entire data to another system Scalable storage performance Tightly balanced system that scales performance linearly as more nodes are added to the system Panasas and Hadoop 16

USING NAS WITH HADOOP Can run on any distribution and any version (Cloudera, Hortonworks, Apache) No updates required for newer versions of Hadoop No need for proprietary software implementation Simple configuration setup Can run on HDFS or run directly on PanFS Layer HDFS over PanFS Configure HDFS pathnames to use /panfs URL: hdfs://panfs/system/workspace Bypass HDFS entirely Configure file:// URLs to use /panfs URL: file://panfs/system/workspace Details captured in a white paper and configuration guide visit www.panasas.com to get a copy of the paper Panasas and Hadoop 17

PERFORMANCE, HDFS OVER PANFS 41% faster than local disk on HDFS (1 copy) 29% faster than local disk on HDFS (2 copy) 2,500 Seconds 2,000 HDFS configured to store data into PanFS Equal # of disks 1,500 1,000 2,302 1,638 TeraValidate TeraSort TeraGen 500 0 Local Disk ActiveStor 14T Download Panasas whitepaper for detailed setup and results http://www.panasas.com/sites/default/files/uploads/docs/hadoop_wp_lr_1096.pdf Panasas and Hadoop 18

PERFORMANCE, HDFS VS PANFS 5000 4500 4000 3500 3000 2500 TeraValidate TeraGen TeraSort Generate, Sort, and Validate 1TB of key/values Seconds to complete Lower is better 2000 1500 1000 500 HDFS: nodes use local disk PanFS: nodes use PanFS HDFS: two-copy replication PanFS: Object RAID 0 HDFS PanFS Panasas and Hadoop 19

SUMMARY The decisions around the original Hadoop hardware platform were driven by dedicated application specific requirements Direct attach dedicated server cluster works when the data set is small or when the entire business revolves around Hadoop Mixed use environments, typical of the enterprise require a system that has flexibility, high-reliability, enterprise fault tolerance and supports typical Disaster recovery strategies Panasas Network attached storage is a viable option for many big data workloads including Hadoop analytics As networking continues to get faster and cheaper Networked storage will become an increasingly viable solution for Hadoop Large data sets are unwieldy on local disk Management headache of the 1990 s in the enterprise again? Hadoop is first an application, the hardware choice depends on the business specific context. Panasas NAS is a viable, high performance solution for mixed-use workloads Panasas and Hadoop 20

THANK YOU Panasas and Hadoop 21