Storage Architectures for Big Data in the Cloud



Similar documents
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

CSE-E5430 Scalable Cloud Computing Lecture 2

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Enabling High performance Big Data platform with RDMA

Design and Evolution of the Apache Hadoop File System(HDFS)

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop & its Usage at Facebook

High Performance Computing OpenStack Options. September 22, 2015

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Accelerating and Simplifying Apache

PARALLELS CLOUD STORAGE

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Apache Hadoop. Alexandru Costan

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

StorPool Distributed Storage Software Technical Overview

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Hadoop Distributed File System. Dhruba Borthakur June, 2007

(Scale Out NAS System)

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

Hadoop Architecture. Part 1

THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Hadoop FileSystem and its Usage in Facebook

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

GraySort and MinuteSort at Yahoo on Hadoop 0.23

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

Hadoop & its Usage at Facebook

Chapter 7. Using Hadoop Cluster and MapReduce

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

Distributed File Systems

Technology Insight Series

Big data management with IBM General Parallel File System

Network Attached Storage. Jinfeng Yang Oct/19/2015

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The Hadoop Distributed File System

Hadoop and Map-Reduce. Swati Gore

Big Data With Hadoop

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Building Storage Service in a Private Cloud

Distributed File Systems

Big Fast Data Hadoop acceleration with Flash. June 2013

Scala Storage Scale-Out Clustered Storage White Paper

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop Parallel Data Processing

A very short Intro to Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

HADOOP MOCK TEST HADOOP MOCK TEST I

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Cloud Computing

Introduction to Gluster. Versions 3.0.x

Software Defined Microsoft. PRESENTATION TITLE GOES HERE Siddhartha Roy Cloud + Enterprise Division Microsoft Corporation

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Storage Virtualization from clusters to grid

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

RED HAT STORAGE SERVER TECHNICAL OVERVIEW

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Windows Server 2012 授 權 說 明

Red Hat Storage Server Administration Deep Dive

Hadoop IST 734 SS CHUNG

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

Taking Linux File and Storage Systems into the Future. Ric Wheeler Director Kernel File and Storage Team Red Hat, Incorporated

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Virtualizing Apache Hadoop. June, 2012

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Cloud Optimize Your IT

Parallel Processing of cluster by Map Reduce

Product Brochure. Hedvig Distributed Storage Platform Modern Storage for Modern Business. Elastic. Accelerate data to value. Simple.

ebay Storage, From Good to Great

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

WHITE PAPER. Software Defined Storage Hydrates the Cloud

GlusterFS Distributed Replicated Parallel File System

Apache Hadoop FileSystem Internals

IBM ELASTIC STORAGE SEAN LEE

Open source Google-style large scale data analysis with Hadoop

Storage Solutions to Maximize Success in VDI Environments

Map Reduce / Hadoop / HDFS

HDFS Users Guide. Table of contents

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

CSE-E5430 Scalable Cloud Computing P Lecture 5

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

<Insert Picture Here> Big Data

Transcription:

Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013

Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas 2

What is big data? (for this talk) Applications that process large volumes of data (many terabytespetabytes) Highly parallel scale-out Scan oriented Regular inter-processor communications pattern Examples MapReduce Pattern matching Log analysis Unstructured data analysis 3

Big data I/O Common patterns Sequential scans Map/reduce Sorting Shuffle Index/graph traversal Disk vs. SSD Performance improvement depends on access method and sequentially of access Do frameworks like Hadoop even make sense for SSD? Does SSD make sense over Ethernet? For now, lets focus on disks disks, SSD is another talk 4

MapReduce in Hadoop Input (HDFS) Mapper Reducer Output (HDFS) 5 Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html

HDFS on local DAS Compute nodes are part of HDFS, data spread across nodes HDFS Protocol 6

Hadoop Distributed File System - HDFS Architecture Java application, not deeply integrated with the server OS Layered on top of a standard FS (e.g., ext, xfs, etc.) Must use Hadoop or a special library to access HDFS files Shared-nothing, all nodes have direct attached disks Write once filesystem must copy a file to modify it HDFS basics Data is organized into files & directories Files are divided into 64-128MB blocks, distributed across nodes Block placement is handled by the NameNode Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possible Blocks replicated to handle failure, replica blocks can be used by compute tasks Checksums used to ensure data integrity Replication: one and only strategy for error handling, recovery and fault tolerance Self Healing Makes multiple copies (typically 3) 7

HDFS File Write Operation 3-way replication 8 Image source: Hadoop, The Definitive Guide Tom White, O Reil

HDFS File Read Operation 9 Image source: Hadoop, The Definitive Guide Tom White, O Reil

HDFS on local DAS - Pros and Cons Pros Writes are highly parallel Large files are broken into many parts, distributed across the cluster Three copies of any file block, one written local, two remote Not a simple round-robin scheme, tuned for Hadoop jobs Job tracker attempts to make reads local If possible, tasks scheduled in same node as the needed file segment Duplicate file segments are also readable, can be used for tasks too 10 Cons Not a replacement for general purpose storage Not a kernel-based POSIX filesystem Incompatible with standard applications and utilities (but future versions of Hadoop are adding more other application models) High replication cost compared with RAID/ shared disk The NameNode keeps track of data location SPOF - location data is critical and must be protected Scalability bottleneck (everything has to be in memory) Improvements to NameNode are in the works

Other options - HDFS with SAN Storage Storage Arrays iscsi or FC SAN Compute nodes Hadoop Cluster 11

Other options - HDFS with SAN Storage Storage Arrays iscsi or FC SAN Compute nodes Cache Hadoop Cluster Cache 12

HDFS File Write Operation Array can provide redundancy, no need to replicate data across data nodes Array Replicatio n 13

HDFS File Read Operation Array redundancy, means only a single source for data 14

SAN for Hadoop Storage Data is in one or more arrays attached to data nodes through a SAN Looks like local storage to data nodes Hadoop still utilizes HDFS Pros All the normal advantages of arrays RAID, centralized caching, thin provisioning, other advanced array features Centralized management, easy redistribution of storage Retains advantages of HDFS (as long as array is not over-utilized) Easy failover when compute node dies, can eliminate or reduce 3-way replication Cons Cost? It depends Unless if multiple arrays are used, scale is limited And with multiple arrays, management and cost advantages are reduced Still have HDFS complexity and manageability issues 15

Tightly coupled DFS for Hadoop General purpose shared file system Implemented in the kernel, single namespace, compatible with most applications (no special library or language) E.g., GPFS, Gluster, IBRIX, Data is distributed across local storage node disks Architecturally like HDFS Can utilize same disk options as HDFS Including shared nothing DAS SAN storage Some can also support shared SAN storage where raw volumes can be accessed by multiple nodes Failover model where only one node actively uses a volume, other can take over after failure Multiple initiator model where multiple nodes actively use a volume Shared nothing option has similar cost/performance to HDFS on DAS 16

Distributed FS local disks Compute nodes are part of the DFS, data spread across nodes Distributed FS inter-node Protocol 17

Distributed FS remote disks Scale out nodes are distributed FS servers Compute nodes are distributed FS clients Distributed FS inter-node Protocol 18

Tightly coupled DFS for Hadoop Pros Shared data access, any node can access any data like it is local POSIX compatible, works for non-hadoop apps just like a local file system Centralized management and administration No NameNode, may have a better block mapping mechanism Compute in-place, same copy can be served via NFS/CIFS Many of the performance benefits Cons HDFS is highly optimized for Hadoop, unlikely to get same optimization for a general purpose DFS Large file striping is not regular, based on compute distribution Copies are simultaneously readable Strict POSIX compliance leads to unnecessary serialization Hadoop assumes multiple-access to files, however, accesses are on block boundaries and don t overlap Need to relax POSIX compliance for large files, or just stick with many smaller files Some DFS s have scaling limitations that are worse than HDFS, not designed for thousands of nodes 19

What about the cloud? Comparison to Big Data Clusters* Similarities Scale-out, lots of compute instances Lots of storage, multiple options Highly networked Differences Cloud is Virtualized not designed for optimal per node performance Cloud is inherently Multi-tenant Cloud typically doesn t use the fastest networking technology Ethernet (not IB) VLANs, security, * Note that we are not including dedicated big data offerings in this analysis 20

Cloud Storage SAS/ SATA VM Host VM VM VM VM VM VM REST Object Storage VM Host VM VM VM VM VM VM 21

Cloud storage options for big data Amazon/OpenStack IaaS Storage Local disk 22 Files on VM host, accessed through LVM Not persistent, typically not very high performance Remote persistent volume storage EBS, Cinder Remote volumes, typically iscsi Varying implementation and performance Object storage REST based access, any application can access it Limited access modes sequential, writeonce Hybrid Options Dedicated cluster E.g., Amazon Elastic Mapreduce Similar to remote local DAS or remote DFS case Remote DFS Clients are VMs running in cloud Could be as simple as data on NFS Or, tightly coupled remote DFS with storage on dedicated nodes

Summary There are lots of storage architectures to choose from What is the predominant access mode? Read or write oriented Sequential or random Local or remote Need to consider entire workflow How does data feed into and out of the cluster Do multiple types of processing need to happen to data Will data be retained, where, how protected Manageability is important often trumps performance Will the job run on a dedicated cluster or the cloud? Virtualized or not? Dedicated service? Local or remote disks? Block or object storage? 23

Research areas What types of storage make the most sense for what workloads? SSD vs. disk, object vs. block Shared vs. dedicated storage Tightly or loosely coupled? POSIX or not? Caching Create some hard data Choices are often based on religion more than data hardware cost vs. TCO Consider entire chain, not just a micro benchmark Small improvements don t matter if you have to give up manageability/ease of use How does this all change in an NVM world? Does NVM address the entire supply chain, or just the data processing part? How can we take advantage of NVM without losing manageability? 24

Thank you