An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Similar documents
New Storage System Solutions

Easier - Faster - Better

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

NetApp High-Performance Computing Solution for Lustre: Solution Guide

PARALLELS CLOUD STORAGE

February, 2015 Bill Loewe

Scala Storage Scale-Out Clustered Storage White Paper

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida


Architecting a High Performance Storage System

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Enabling High performance Big Data platform with RDMA

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Quick Reference Selling Guide for Intel Lustre Solutions Overview

Data management challenges in todays Healthcare and Life Sciences ecosystems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Current Status of FEFS for the K computer

Understanding Hadoop Performance on Lustre

POSIX and Object Distributed Storage Systems

Sun Constellation System: The Open Petascale Computing Architecture

Introduction to Gluster. Versions 3.0.x

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

HadoopTM Analytics DDN

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Lustre * Filesystem for Cloud and Hadoop *

Accelerating and Simplifying Apache

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

SUN ORACLE DATABASE MACHINE

Quantum StorNext. Product Brief: Distributed LAN Client

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Seagate Lustre Update. Peter Bojanic

Hadoop on the Gordon Data Intensive Cluster

I/O Considerations in Big Data Analytics

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

Big + Fast + Safe + Simple = Lowest Technical Risk

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

SMB Direct for SQL Server and Private Cloud

GeoGrid Project and Experiences with Hadoop

Quantcast Petabyte Storage at Half Price with QFS!

High Performance Computing OpenStack Options. September 22, 2015

NextGen Infrastructure for Big DATA Analytics.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Clusters: Mainstream Technology for CAE

Big Fast Data Hadoop acceleration with Flash. June 2013

ioscale: The Holy Grail for Hyperscale

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Understanding Enterprise NAS

Avid ISIS

Mellanox Accelerated Storage Solutions

With DDN Big Data Storage

Hadoop: Embracing future hardware

Integrated Grid Solutions. and Greenplum

HPC Advisory Council

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Software-defined Storage Architecture for Analytics Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

ETERNUS CS High End Unified Data Protection

Energy Efficient MapReduce

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

The functionality and advantages of a high-availability file server system

VTrak SATA RAID Storage System

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

How To Write An Article On An Hp Appsystem For Spera Hana

PRIMERGY server-based High Performance Computing solutions

Lustre & Cluster. - monitoring the whole thing Erich Focht

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Benchmarking Hadoop & HBase on Violin

Google File System. Web and scalability

Hadoop implementation of MapReduce computational model. Ján Vaňo

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Cluster Implementation and Management; Scheduling

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Designing a Cloud Storage System

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Large scale processing using Hadoop. Ján Vaňo

Storage Architectures for Big Data in the Cloud

Lustre SMB Gateway. Integrating Lustre with Windows

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Big Data Meets High Performance Computing

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Transcription:

An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing

MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates results (Reduce) Hadoop is a popular open source MapReduce S/W Processes unstructured and semi-structured data HDFS uses location info to replicate information between nodes By Default 3 copies *Hadoop Demystified Rare Mile Technologies 8

About the Hadoop File System (HDFS) WORM access model Uses commodity hardware with the expectation that failures will occur Reads data in large, contiguous data blocks and process very large files Is Hardware agnostic Assumes that moving computation is cheaper than moving data 9

HDFS Performance is Limited HDFS Premise Moving Computation is Cheaper Than Moving Data The data ALWAYS has to be moved Either from local disk Or from the network Includes Replication operations for availability Results data movement And with a good network: the network wins Hadoop performance is gated by file system performance 10

Hadoop File System (HDFS) Challenges Performance a lack of caching in the case of random loads slow file modifications due to WORM and synchronous replication HTTP used for data transfer cannot use DMA Scalability Large block sizes limits the number of files Limits full use of resources in the case when data is not at the CPU HDFS RAID can eliminate need for replication but impacts CPU Storage Not POSIX compliant and non-general purpose access Data transfer into and out of Hadoop environment is required Data Replication storage costs 11

Lustre High Performance File System Alternative CIFS Client Object Storage Servers () 1-1,000s Object Storage Target (OST) NFS Client Gateway disk Client Router disk Client Support multiple network types Gemini, Myrinet, IB, GigE disk Client Metadata Servers (MDS) MDS MDS Lustre Client 1-100,000 Metadata Target (MDT) disk Disk arrays & SAN Fabric 12

Comparing HDFS to Lustre Cluster Setup Scenario 100 clients, 100 disks, Infiniband Disks: 1 TB High Capacity SAS drives (Seagate Barracuda) 80 MB/sec bandwidth with cache off Network: 4xSDR Infiniband 1GB/s HDFS: 1 drive per client Lustre: 10 s with 10 OSTs

HDFS Setup local local local Client Client Client IB Switch 80MB/s 1GB/s

Lustre Setup Client Client Client IB Switch OST OST OST OST OST OST 80MB/s 1GB/s

Comparing HDFS to Lustre Theoretical Part I 100 clients, 100 disks, SDR Infiniband HDFS: 1 drive per client Local client bandwidth is 80MB/s Lustre: Each has Lustre bandwidth is 800MB/s aggregate (80MB/s * 10) Assuming bus bandwidth to access all drives simultaneously Net bandwidth 1GB/s (IB is point to point) With 10 s, we have same capacity & bandwidth Network is not the limiting factor!

Comparing HDFS to Lustre Theoretical Part II - Striping In terms of raw bandwidth, network does not limit data access rate Striping the data for each Hadoop data block, we can focus our bandwidth on delivering a single block HDFS limit, for any 1 node: 80MB/s Lustre limit, for any 1 node: 800MB/s Assuming striping across 10 OSTs Can deliver that to 10 nodes simultaneously Typical MR workload is not simultaneous access (after initial job kickoff) 17

MapReduce I/O Benchmark 8 Nodes QDR IB 8 Drives (80MB/s) HDFS -8 Nodes -1 Disk each Lustre -2-4 OST Disks 18

MR Sort Benchmark Hadoop Data Movement Limited to: Local disk & HTTP Protocols 19

Lustre Advantages for Hadoop Performance Caching file system with complete cache coherence High performance file modifications replication not required Uses high speed DMA for data transfers Scalability Support for billions of files 2.5 Billion All compute clients have access to data Can leverage standard data and system availability techniques Storage POSIX compliant No data transfer for pre and post processing required Reduces need to manage multiple copies between analytic systems 20

ClusterStor 6000 A Big Data Scale-Out Solution Delivering the Ultimate in HPC Data Storage with: Optimized time to productivity Efficiency, application availability, results Unmatched file system performance Delivered! Industry s fastest just got two times faster Highest reliability, availability and serviceability Enterprise level resiliency 21

ClusterStor Solutions An integrated and scalable HPC data storage solution designed to be Easy to deploy, use, and manage Delivering efficiency, application availability, and massive results 22

Lustre Community and Xyratex Roles in the Lustre Community OpenSFS & EOFS Board Member - Direct funding of Lustre tree & roadmap development Active Contributor to Lustre Source & Roadmap -World class Lustre development team on staff Integration of Lustre into ClusterStor - Industry leading HPC storage solutions Lustre Support Services -ClusterStor, Lustre & 3 rd party hardware

ClusterStor 6000 Optimized time to productivity Uses Xyratex exclusive parallel scale-out file system processing and I/O architecture Leverages latest in Xyratex application platform technologies and Lustre integration Optimized HW/SW Fully Integrated Factory Tested Shipped Ready to Go Results in increased file system throughput and capacity efficiencies on a per rack unit volume basis 24

ClusterStor Delivers Scale-Out Lustre Scalable Storage Unit - SSU - Building Block CIFS Client NFS Client Gateway Object Storage Servers () 1-1,000s Object Storage Target (OST) disk ClusterStor SSU Client Router disk Client Support multiple network types Gemini, Myrinet, IB, GigE disk Client Metadata Servers (MDS) ClusterStor HA-MDS MDS MDS Lustre Client 1-100,000 Metadata Target (MDT) disk Disk arrays & SAN Fabric 25

ClusterStor 6000 Scale-Out Building Blocks Unmatched file system performance Delivered! Industry s fastest just got two times faster Each ClusterStor 6000 Scalable Storage Unit (SSU) Produces 6 GB/sec of File System Performance Linear processing scalability supports installations up to 1 TB/s file system throughput and tens of PBs of storage capacity 26

ClusterStor Scalable Storage Unit (SSU) 27 *Xyratex ClusterStor White Paper

ClusterStor 6000 ClusterStor 6000 SSU Produces 6.0 GB/sec IOR Doubles SSU Performance ClusterStor Embedded Server Module Two Modules per SSU for high availability Increased Performance 42GB/sec per rack Latest Processor Technology 2X Memory FDR InfiniBand 28

ClusterStor Family Performance and Capacity More Performance and Storage Capacity in Less Space GigaBytes Performance (User Level Sustained IOR Lustre File System Performance) 360 30 60 90 120 ClusterStor 6000 Doubles SSU Performance 150 Number of SSUs 270 180 ClusterStor 3000 90 5.76 11.52 17.28 23.04 28.80 PetaBytes (User Level Storage Capacity) 29

ClusterStor 6000 Highest reliability, availability and serviceability Fully resilient software-hardware integration with low level diagnostics, embedded monitoring, enterprise level data protection architecture, proactive alerts 30 Easy to Manage Real Time Monitoring

ClusterStor Powering The Fastest Storage System in The World (Q3 2012) >1TB/second Aggregate Bandwidth Xyratex CS-6000 System Number of Racks: 36 Square Footage: 644 ft 2 Hard Drives: 17,280 Power: ~0.443MW Heat Dissipation (BTUs): 1,165,600 Exponentially less cost, space, cooling and power than the competition! Xyratex Confidential

Links Xyratex http://www.xyratex.com/ NCSA http://www.ncsa.illinois.edu/ Hadoop Demystified http://blog.raremile.com/2012/06/hadoop-demystified/ Wikibon on Big Data http://wikibon.org/wiki/v/big_data http://wikibon.org/blog/taming-big-data/ 32

Thank You 33 Xyratex Confidential