HFAA: A Generic Socket API for Hadoop File Systems

Similar documents
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Architecture. Part 1

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Apache Hadoop. Alexandru Costan

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

THE HADOOP DISTRIBUTED FILE SYSTEM

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Apache Hadoop FileSystem Internals

Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

A very short Intro to Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Storage Architectures for Big Data in the Cloud

Accelerating and Simplifying Apache

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Apache Hadoop new way for the company to store and analyze big data

Snapshots in Hadoop Distributed File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop & its Usage at Facebook

Apache HBase. Crazy dances on the elephant back

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Performance Analysis of Mixed Distributed Filesystem Workloads

Hadoop. Sunday, November 25, 12

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Mixing Hadoop and HPC Workloads on Parallel Filesystems

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

MapReduce, Hadoop and Amazon AWS

Storage Virtualization in Cloud

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

L1: Introduction to Hadoop

Hadoop & its Usage at Facebook

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Big Data Technology Core Hadoop: HDFS-YARN Internals

BIG DATA TECHNOLOGY. Hadoop Ecosystem

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Big Data Analytics. Lucas Rego Drumond

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Optimize the execution of local physics analysis workflows using Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Hadoop implementation of MapReduce computational model. Ján Vaňo

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Lustre * Filesystem for Cloud and Hadoop *

Hadoop Parallel Data Processing

Open source Google-style large scale data analysis with Hadoop

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Analytics - Accelerated. stream-horizon.com

Intro to Map/Reduce a.k.a. Hadoop

Hadoop Distributed File System (HDFS) Overview

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Big Data Trends and HDFS Evolution

Matt Benton, Mike Bull, Josh Fyne, Georgie Mackender, Richard Meal, Louise Millard, Bogdan Paunescu, Chencheng Zhang

International Journal of Advance Research in Computer Science and Management Studies

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

A Brief Outline on Bigdata Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Data-Intensive Computing with Map-Reduce and Hadoop

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Reduction of Data at Namenode in HDFS using harballing Technique

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Bright Cluster Manager

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

Large scale processing using Hadoop. Ján Vaňo

HADOOP MOCK TEST HADOOP MOCK TEST I

Application Development. A Paradigm Shift

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data With Hadoop

Hadoop Scheduler w i t h Deadline Constraint

How To Use Hadoop

Performance and Energy Efficiency of. Hadoop deployment models

Introduction to Cloud Computing

Transcription:

Second Workshop on Architectures and Systems for Big Data (ASBD 2012) June 9 th, 2012 HFAA: A Generic Socket API for Hadoop File Systems Adam Yee and Jeffrey Shafer University of the Pacific

2 Hadoop MapReduce Hadoop: Open source framework for data- intensive compu<ng Inspired by Google s web indexing framework Uses MapReduce parallel programming model Enables scalable computa<on on a commodity cluster computer Popular and in widespread use today Amazon, Facebook, MicrosoL Bing, Yahoo,

3 Hadoop Software Stack Hadoop is an all- in- one so?ware framework that <es the cluster together ComputaDon Execute Map and Reduce tasks Storage User- level filesystem for applica<ons HDFS Hadoop Distributed File System Scheduling Distribute jobs across cluster Reliability Data replica<on, re- spawning failed jobs Designed for portability (WriXen in Java)

4 Motivating Challenge 1. I already have a cluster computer 2. I already have a different distributed file system running (and exper<se in managing it) File system examples: PVFS, Ceph, Lustre, GPFS Will refer to them collec<vely as NewDFS for the remainder of talk 3. Hadoop (MapReduce) is only a small part of my computa<on workload Hadoop needs to come to me, and not the other way around

5 Motivating Challenge This is harder than it sounds! Hadoop is Dghtly integrated with HDFS (Hadoop Distributed File System) Holds data input and computa<on output How can I use Hadoop to process data stored in other distributed file systems?

6 Use Hadoop with NewDFS Method 1: Copy the data from NewDFS to HDFS Pros: Easy J Cons: Slow and wastes storage space Method 2: Mount NewDFS using a POSIX driver (which can be directly accessed in Hadoop) Pros: Easy; Faster than making a copy first! (but s<ll slow) Cons: Lose Hadoop performance op<miza<ons (like data locality) Method 3: Custom solware layer integrates directly with Hadoop Pros: Near- na<ve speed Cons: Highest complexity; Requires detailed knowledge of Hadoop and NewDFS architecture

7 Using Hadoop with NewDFS Past Projects Hadoop with CloudStore Hadoop with Ceph Hadoop with GPFS Hadoop with Lustre Hadoop with PVFS So is this a solved problem? No! Limita<ons of Prior Work All of these are point solu<ons Different implementa<on strategies Each require different solware patches to Hadoop

8 Hadoop Filesystem Agnostic API (HFAA) Universal, generic interface Allows Hadoop to run on any file system that supports network sockets Since we re targebng distributed file systems, that should include everyone Design moves integra<on responsibili<es outside of Hadoop Does not require user or developer knowledge of the Hadoop framework

9 Traditional Hadoop FileSystem is Hadoop s storage API class HDFS Local disk Amazon S3 Java implementa<on Cluster Node Hadoop MapReduce Applica<on(s) TaskTracker FileSystem + HDFS Client Java Run<me Opera<ng System (Linux) HDFS DataNode HDFS Daemon Disk Disk

10 Hadoop + HFAA Client: Integrates with Hadoop (Java) Reusable for any NewDFS Hadoop MapReduce Applica<on(s) TaskTracker FileSystem + HFAA Client NewDFS HFAA Server Cluster Node Server: Integrates with NewDFS (Any language) Java Run<me Opera<ng System (Linux) NewDFS Daemon Disk Disk

11 HFAA Operations Fundamental Hadoop storage opera<ons How do we know API is complete? Every class that extends FileSystem (including HFAA) must implement these abstract methods FuncDon Open Create Append Rename Delete List Status Make Directories Get File State Write Read

12 Evaluation System HFAA prototype wrixen for Hadoop + PVFS Hadoop 0.20.204.0 Released Sept 5 th, 2011 OrangeFS 2.8.4 Branch of PVFS Run on small 4- node cluster Streaming reads and writes Architectures Compared 1. Raw disk (outside Hadoop) 2. Hadoop with na<ve HDFS 3. Hadoop with OrangeFS using POSIX driver 4. Hadoop with OrangeFS using HFAA

13 Evaluation Bandwidth (MB/s) 140 120 100 80 60 40 20 0 Streaming Write Raw Disk Hadoop+OFS Hadoop+HDFS Streaming Read Hadoop+HFAA+OFS

14 Next Steps Performance Op<miza<ons Expose file system locality to Hadoop scheduler More important when we test on larger clusters Socket re- use between requests Less detrimental because HDFS moves 64MB data block per request Tune, tune, tune! Broader Compa<bility Implement HFAA Server component for other popular file systems Lustre? Ceph? Others? Release for Hadoop 1.0.x family

15 Summary Hadoop Filesystem AgnosDc API (HFAA) Generic interface supports any distributed file system Implementa<on includes two components HFAA Client Interfaces with Hadoop HFAA Server Interfaces with PVFS Client can be re- used with all future filesystems Have a new filesystem you like? You only need to understand your filesystem and our simple API to link it to Hadoop You don t have to be a Hadoop expert

16 Questions????