Sector vs. Hadoop. A Brief Comparison Between the Two Systems



Similar documents
Hadoop Architecture. Part 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

THE HADOOP DISTRIBUTED FILE SYSTEM

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop IST 734 SS CHUNG

GraySort and MinuteSort at Yahoo on Hadoop 0.23

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop. Alexandru Costan

Introduction to HDFS. Prasanth Kothuri, CERN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Distributed File System (HDFS) Overview

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed Filesystems

Data Mining for Data Cloud and Compute Cloud

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

The Hadoop Distributed File System

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Data-Intensive Computing with Map-Reduce and Hadoop

A very short Intro to Hadoop

Distributed Data Parallel Computing: The Sector Perspective on Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Ecosystem B Y R A H I M A.

HDFS Architecture Guide

A Brief Outline on Bigdata Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Big Data With Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HDFS Users Guide. Table of contents

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Parallel Processing of cluster by Map Reduce

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

HadoopRDF : A Scalable RDF Data Analysis System

NoSQL and Hadoop Technologies On Oracle Cloud

TP1: Getting Started with Hadoop

HDFS: Hadoop Distributed File System

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop and Map-Reduce. Swati Gore

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Distributed File Systems

Apache Hadoop new way for the company to store and analyze big data

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Map Reduce & Hadoop Recommended Text:

Hadoop Distributed File System Propagation Adapter for Nimbus

Case Study : 3 different hadoop cluster deployments

HADOOP MOCK TEST HADOOP MOCK TEST I

Storage Architectures for Big Data in the Cloud

Processing of Hadoop using Highly Available NameNode

International Journal of Advance Research in Computer Science and Management Studies

L1: Introduction to Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

The Hadoop Distributed File System

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage


Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Design and Evolution of the Apache Hadoop File System(HDFS)

Big Application Execution on Cloud using Hadoop Distributed File System

Enabling High performance Big Data platform with RDMA

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Accelerating and Simplifying Apache

Generic Log Analyzer Using Hadoop Mapreduce Framework

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Fault Tolerance in Hadoop for Work Migration

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Hadoop & its Usage at Facebook

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Understanding Hadoop Performance on Lustre

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop & Spark Using Amazon EMR

COURSE CONTENT Big Data and Hadoop Training

Deploying Hadoop with Manager

Apache Hadoop FileSystem and its Usage in Facebook

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

MapReduce and Hadoop Distributed File System

Transcription:

Sector vs. Hadoop A Brief Comparison Between the Two Systems

Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information please contact Yunhong If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.

Design Goals Sector Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Hadoop Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface

History Sector Hadoop Started around 2005-2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First modern version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 2006. Y! took over the project in 2006. Hadoop split from Nutch. First modern version released in 2008.

Architecture Sector Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Hadoop Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes

Distributed File System Sector General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance HDFS Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance

Replication Sector System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only HDFS System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack

Security Sector A Sector security server is used to maintain user credentials and permission, i server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Hadoop Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption

Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.

In-Storage Data Processing Sphere Apply arbitrary userdefined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Hadoop MapReduce Support the MapReduce framework

Sphere UDF vs. Hadoop MapReduce Sphere UDF Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 4x faster in various benchmarks Hadoop MapReduce Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance

Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area a systems) s)

Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hd Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.

Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.

Our Recommendations Consider using Sector if: You need a scalable, fault-tolerant tolerant, general purpose file system You have data across multiple data centers You have users who upload and download data from remote locations You are a C++ programmer You have many legacy applications and you don t want to rewrite them to suit a new platform You value the fact that Sphere is about 2 4 times faster than Hadoop

Our Recommendations (cont.) Consider using Hadoop if: You are a Java programmer It is important for you to have data semantic support such as HBase or Hive You can benefit from reusing existing packages from the larger Hadoop community