Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Similar documents

Storage Architectures for Big Data in the Cloud

A very short Intro to Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

High Performance Computing OpenStack Options. September 22, 2015

Sujee Maniyam, ElephantScale

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

How it can benefit your enterprise. Dejan Kocic Hitachi Data Systems (HDS)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Architecture. Part 1

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data With Hadoop

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

How it can benefit your enterprise. Dejan Kocic Netapp

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction to Analytics and Big Data - Hadoop. Rob Peglar EMC Isilon

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Data-Intensive Computing with Map-Reduce and Hadoop

Parallel Processing of cluster by Map Reduce

Open source Google-style large scale data analysis with Hadoop

Apache Hadoop. Alexandru Costan

Hadoop IST 734 SS CHUNG

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

ADVANCED DEDUPLICATION CONCEPTS. Larry Freeman, NetApp Inc Tom Pearce, Four-Colour IT Solutions

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Parallel Data Processing

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Apache Hadoop new way for the company to store and analyze big data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MapReduce. Tushar B. Kute,

Cloud and Big Data initiatives. Mark O Connell, EMC

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Accelerating and Simplifying Apache

Introduction to MapReduce and Hadoop

UNDERSTANDING DATA DEDUPLICATION. Tom Sas Hewlett-Packard

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Fault Tolerance in Hadoop for Work Migration

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Understanding Hadoop Performance on Lustre

UNDERSTANDING DATA DEDUPLICATION. Thomas Rivera SEPATON

Understanding Enterprise NAS

Hadoop Distributed File System Propagation Adapter for Nimbus

Trends in Application Recovery. Andreas Schwegmann, HP

Large scale processing using Hadoop. Ján Vaňo

HADOOP PERFORMANCE TUNING

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop implementation of MapReduce computational model. Ján Vaňo

Best Practice and Deployment of the Network for iscsi, NAS and DAS in the Data Center

Hadoop & its Usage at Facebook

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

I/O Considerations in Big Data Analytics

Accelerating Applications and File Systems with Solid State Storage. Jacob Farmer, Cambridge Computer

BBM467 Data Intensive ApplicaAons

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

DFS For Not-So Dummies. Matthew Geddes

UNDERSTANDING DATA DEDUPLICATION. Jiří Král, ředitel pro technický rozvoj STORYFLEX a.s.

Intro to Map/Reduce a.k.a. Hadoop

International Journal of Advance Research in Computer Science and Management Studies

Hadoop Distributed File System (HDFS) Overview

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Enabling High performance Big Data platform with RDMA

Apache Hadoop FileSystem and its Usage in Facebook

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Design and Evolution of the Apache Hadoop File System(HDFS)

Processing of Hadoop using Highly Available NameNode

Big Data - Infrastructure Considerations

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Dhruba Borthakur June, 2007

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed File Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

CDH AND BUSINESS CONTINUITY:

Scala Storage Scale-Out Clustered Storage White Paper

Open source large scale distributed data management with Google s MapReduce and Bigtable

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop Scheduler w i t h Deadline Constraint

Apache Hadoop FileSystem Internals

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Can Storage Fix Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

NoSQL and Hadoop Technologies On Oracle Cloud

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Transcription:

Sam Fineberg, HP Storage

SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2

Abstract The Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information. It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes. In this presentation Hadoop will be looked at from a storage perspective. The tutorial will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and some other options for Hadoop storage that exist in the commercial and open source communities. 3

What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Core Hadoop has two main components MapReduce: fault-tolerant distributed processing Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Hadoop Distributed File System (HDFS): self-healing, high-bandwidth clustered storage Reliable, redundant, distributed file system optimized for large files Operates on unstructured and structured data A large and active ecosystem Written in Java Open source under the friendly Apache License http://wiki.apache.org/hadoop/ 4

What is MapReduce? A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases 1. Map 2. Reduce In between Map and Reduce is the Shuffle and Sort 5

MapReduce Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html 6

MapReduce Operation What was the max temperature for the last century? 7

Key MapReduce Terminology Concepts A user runs a client program (typically a Java application) on a client computer The client program submits a job to Hadoop The job is sent to the JobTracker process on the Master Node Each Slave Node runs a process called the TaskTracker The JobTracker instructs TaskTrackers to run and monitor tasks A task attempt is an instance of a task running on a slave node There will be at least as many task attempts as there are tasks which need to be performed 8

MapReduce in Hadoop Input (HDFS) Mapper Task Tracker Reducer Worker=Tasks Output (HDFS) Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html 9

MapReduce: Basic Concepts Each Mapper processes single input split from HDFS Hadoop passes one record at a time to the developer s Map code Each record has a key and a value Intermediate data written by the Mapper to local disk During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer Reducer is passed each key and a list of all its values Output from Reducers is written to HDFS 10

What is a Distributed File System? A distributed file system is a file system that allows access to files from multiple hosts across a network A network filesystem (NFS/CIFS) is a type of distributed file system more tuned for file sharing than distributed computation Distributed computing applications, like Hadoop, utilize a tightly coupled distributed file system Tightly coupled distributed filesystems Provide a single global namespace across all nodes Support multiple initiators, multiple disk nodes, multiple access to files file parallelism Examples include HDFS, pnfs, as well as many commercial and research systems 11

Hadoop Distributed File System - HDFS Architecture Java application, not deeply integrated with the server OS Layered on top of a standard FS Must use Hadoop or a special library to access HDFS files Shared-nothing, all nodes have direct attached disks Write once filesystem must copy a file to modify it HDFS basics Data is organized into files & directories Files are divided into 64-128MB blocks, distributed across nodes Block placement is handled by the NameNode Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possible Blocks replicated to handle failure, replica blocks can be used by compute tasks Checksums used to ensure data integrity Replication: one and only strategy for error handling, recovery and fault tolerance Self Healing Makes multiple copies (typically 3) 12

HDFS on local DAS A Hadoop cluster consisting of many nodes, each of which has local direct attached storage (DAS) Disks are running a file system HDFS blocks are stored as files in a special directory Disks attached directly, for example, with SAS or SATA No storage is shared, disks only attach to a single node The most common use case for Hadoop Original design point for Hadoop/HDFS Can work with cheap unreliable hardware Some very large systems utilize this model 13

HDFS on local DAS Compute nodes are part of HDFS, data spread across nodes HDFS Protocol 14

HDFS File Write Operation Image source: Hadoop, The Definitive Guide Tom White, O Reilly 15

HDFS File Read Operation Image source: Hadoop, The Definitive Guide Tom White, O Reilly 16

HDFS on local DAS - Pros and Cons Pros Writes are highly parallel Large files are broken into many parts, distributed across the cluster Three copies of any file block, one written local, two remote Not a simple round-robin scheme, tuned for Hadoop jobs Job tracker attempts to make reads local If possible, tasks scheduled in same node as the needed file segment Duplicate file segments are also readable, can be used for tasks too Cons Not a kernel-based POSIX filesystem incompatible with standard applications and utilities High replication cost compared with RAID/shared disk The NameNode keeps track of data location SPOF (for now), location data is critical and must be protected Scalability bottleneck (everything has to be in memory) 17

Other HDFS storage options HDFS on Storage Area Network (SAN) attached storage A lot like DAS, but disks are logical volumes in storage array(s), accessed across a SAN HDFS doesn t know the difference SAN attached arrays aren t the same as DAS Array has its own cache, redundancy, replication, etc. Any node on the SAN can access any array volume So a new node can be assigned to a failed node s data 18

HDFS with SAN Storage Storage Arrays iscsi or FC SAN Compute nodes Hadoop Cluster 19

HDFS File Write Operation Array can provide redundancy, no need to replicate data across data nodes Array Replication 20

HDFS File Read Operation Array redundancy, means only a single source for data 21

SAN for Hadoop Storage Instead of storing data on direct attached local disks, data is in one or more arrays attached to data nodes through a SAN Looks like local storage to data nodes Hadoop still utilizes HDFS Pros All the normal advantages of arrays RAID, centralized caching, thin provisioning, other advanced array features Centralized management, easy redistribution of storage Retains advantages of HDFS (as long as array is not over-utilized) Easy failover when compute node dies, can eliminate or reduce 3-way replication Cons Cost? It depends Unless if multiple arrays are used, scale is limited And with multiple arrays, management and cost advantages are reduced Still have HDFS complexity and manageability issues 22

Other distributed filesystems Kernel-based tightly coupled distributed file system Kernel-based, i.e., no special access libraries, looks like a normal local file system These filesystems have existed for years in high performance computing, scale-out NAS servers, and other scale-out computing environments Many commercial and research examples Not originally designed for Hadoop like HDFS Location awareness is part of the file system no NameNode Works better if functionality is exposed to Hadoop Compute nodes may or may not have local storage Compute nodes are part of the storage cluster, but may be diskless i.e., equal access to files and global namespace Can tie the filesystem s location awareness into task tracker to reduce remote storage access Remote storage is accessed using a filesystem specific inter-node protocol Single network hop due to filesystem s location awareness 23

Distributed FS local disks Compute nodes are part of the DFS, data spread across nodes Distributed FS inter-node Protocol 24

Distributed FS remote disks Scale out nodes are distributed FS servers Compute nodes are distributed FS clients Distributed FS inter-node Protocol 25

Remote DFS Write Operation Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS 26

Local DFS Write Operation Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS 27

Remote DFS Read Operation Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS 28

Local DFS Read Operation Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS 29

Tightly coupled DFS for Hadoop General purpose shared file system Implemented in the kernel, single namespace, compatible with most applications (no special library or language) Data is distributed across local storage node disks Architecturally like HDFS Can utilize same disk options as HDFS Including shared nothing DAS SAN storage Some can also support shared SAN storage where raw volumes can be accessed by multiple nodes Failover model where only one node actively uses a volume, other can take over after failure Multiple initiator model where multiple nodes actively use a volume Shared nothing option has similar cost/performance to HDFS on DAS 30

Tightly coupled DFS for Hadoop Pros Shared data access, any node can access any data like it is local POSIX compatible, works for non-hadoop apps just like a local file system Centralized management and administration No NameNode, may have a better block mapping mechanism Compute in-place, same copy can be served via NFS/CIFS Many of the performance benefits Cons HDFS is highly optimized for Hadoop, unlikely to get same optimization for a general purpose DFS Large file striping is not regular, based on compute distribution Copies are simultaneously readable Strict POSIX compliance leads to unnecessary serialization Hadoop assumes multiple-access to files, however, accesses are on block boundaries and don t overlap Need to relax POSIX compliance for large files, or just stick with many smaller files Some DFS s have scaling limitations that are worse than HDFS, not designed for thousands of nodes 31

Summary Hadoop provides a scalable fault-tolerant environment for analyzing unstructured and structured information The default way to store data for Hadoop is HDFS on local direct attached disks Alternatives to this architecture include SAN array storage and tightly-coupled general purpose DFS They can provide some significant advantages However, they aren t without their downsides, its hard to beat a filesystem designed specifically for Hadoop Which one is best for you? Depends on what is most important cost, manageability, compatibility with existing infrastructure, performance, scale, 32

Refer to Other Tutorials Use this icon to refer to other SNIA Tutorials where appropriate Check out SNIA Tutorial: Enter Tutorial Title Here Use the Hands-On Lab icon only if you ve been notified that your tutorial is a cross-referenced match to the Hands-On Lab 33

Attribution & Feedback The SNIA Education Committee would like to thank the following individuals for their contributions to this Tutorial. Authorship History Sam Fineberg, August 2012 Additional Contributors Rob Peglar Updates: Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org 34