Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications



Similar documents
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Architecture. Part 1

HDFS: Hadoop Distributed File System

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

The Hadoop Distributed File System

Analysis of Information Management and Scheduling Technology in Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Fault Tolerance in Hadoop for Work Migration

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

HadoopRDF : A Scalable RDF Data Analysis System

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Design and Evolution of the Apache Hadoop File System(HDFS)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data With Hadoop

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Storage Architecture Design in Cloud Computing

Big Data Analytics - Accelerated. stream-horizon.com

GraySort on Apache Spark by Databricks

An improved task assignment scheme for Hadoop running in the clouds

Enabling High performance Big Data platform with RDMA

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Processing of Hadoop using Highly Available NameNode

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Apache Hadoop FileSystem and its Usage in Facebook

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Apache Hadoop. Alexandru Costan

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Survey on Scheduling Algorithm in MapReduce Framework

A Cost-Evaluation of MapReduce Applications in the Cloud

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp


Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Apache Hadoop FileSystem Internals

Keywords: Big Data, HDFS, Map Reduce, Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Data-Intensive Computing with Map-Reduce and Hadoop

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to HDFS. Prasanth Kothuri, CERN

Telecom Data processing and analysis based on Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

The Hadoop Distributed File System

Data Semantics Aware Cloud for High Performance Analytics

Hadoop IST 734 SS CHUNG

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Hadoop & its Usage at Facebook

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

MapReduce and Hadoop Distributed File System

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Performance Characteristics of MapReduce Applications on Scalable Clusters

MapReduce with Apache Hadoop Analysing Big Data

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Scheduling Algorithms in MapReduce Distributed Mind

HDFS Users Guide. Table of contents

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

THE HADOOP DISTRIBUTED FILE SYSTEM

Chapter 7. Using Hadoop Cluster and MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

In Memory Accelerator for MongoDB

HADOOP MOCK TEST HADOOP MOCK TEST I

GraySort and MinuteSort at Yahoo on Hadoop 0.23

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop and Map-Reduce. Swati Gore

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Application Development. A Paradigm Shift

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Improving MapReduce Performance in Heterogeneous Environments

Hadoop Distributed File System (HDFS) Overview

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Network Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Transcription:

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Liqiang (Eric) Wang, Hong Zhang University of Wyoming Hai Huang IBM T.J. Watson Research Center

Background Hadoop: Apache Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets. Hadoop has two main components: MapReduce and Hadoop Distributed File System (HDFS). In our research, we focused on how to optimize the performance of uploading data to HDFS.

Motivation HDFS is inefficient when handling upload of data files from client local file system due to its synchronous pipeline design. Original HDFS transfers data blocks one by one and waiting for ACK (acknowledgement) packets from all datanodes involved in the transmission

Objectives To introduce an innovative asynchronous data transmission approach to greatly improve the write operation s performance in HDFS To support flexible sort of datanodes in pipelines based on real-time and historical datanode accessing condition To provide a comprehensive fault tolerance mechanism under this asynchronous transmission approach

Optimization of Big Data System [ICPP 2014] Hadoop: Open-source implementation of MapReduce MapReduce Hadoop Distributed File System (HDFS). Performance Issue with HDFS HDFS write is much slower than SCP.

Problem Scrutinizing Finding the problem is not easy No detailed document available. Need to read source code Profiling and Testing Reason for slow upload performance data block transmission mechanism : Synchronous pipelined stop-and-wait Client Datanode1 Datanode2 Datanode3

Our Own Approach Asynchronous Multi-pipelined Data Transfer

Optimization for Data Transmission To selects a datanode randomly from the n best performing nodes for this client as the first datanode Randomly choose data nodes for the replicas..

Datanode Exchange To decide whether to swap the first datanode with another datanode in the pipeline in order to give a chance to all nodes.

Fault tolerance for HDFS To check the validity of parameters, close all streams related to the block. move all packets in ACK queue back to data Queue To pick the primary datanode from active datanodes in pipeline, and use it to recover the other datanodes.

Fault tolerance for SMARTH To stops the current block sending, and starts a recovery process as Alg. 3 to recover error pipelines in error pipeline set.

Buffer Overflow Problem Two Conditions Limit the pipeline size to a maximum number ( the cluster size / the number of replica), And if a datanode is already in a pipeline, it cannot be added into other pipelines created by the same client. Result: Each datanode belongs to only one pipeline

Data imbalance problem Conditions Always choose a random datanode from Top N as the first datanode (N = the cluster size / the number of replica), And select other datanodes from left active datanodes. Hence there is no imbalance problem.

Experiments Setup We use four different clusters in our evaluations. Three of the clusters are homogeneous consisted of one namenode and nine datanodes, i.e., of small, medium, or large instances. The other cluster is heterogeneous consisted of 3 small, 4 medium, and 3 large instance nodes

Experiments Two-Rack Cluster

Experiments Two-Rack Cluster

Experiments Bandwidth Contention

Experiments Heterogeneous Cluster Without any network throttling, Figure 13 shows that it takes 289 seconds to upload an 8 GB file in HDFS, but SMARTH only takes 205 seconds, which is 41% faster.

Conclusion To introduce an asynchronous multi-pipeline file transfer protocol with a revised fault tolerance mechanism instead of the HDFS s default stop-and-wait single-pipeline protocol. To employ global and local optimization techniques to sort datanodes in pipelines based on the historical data transfer speed.

SPLSQR: Optimizing HPC Performance and Scalability (ICCS 2012, 2013, Collaborative Project with Dr. John Dennis, NCAR) A Scalable Parallel LSQR (SPLSQR) algorithm for solving large-scale linear system in seismic tomography and reservoir simulation. Main idea: optimize partition strategy to significantly reduce the communication cost and improve the overall performance. Much faster (17-74x) than the widelyused PETSc Reported by NCSA magazine & NSF (http://www.nsf.gov/news/news_summ.jsp? cntn_id=128020&org=nsf&from=news) 20

GPU Acceleration (TeraGrid11, HPCS12, IEEE TPDS) Based on our accurate GPU performance modeling, optimize the performance of SpMV (Sparse Matrix-Vector Multiplication). Main idea: a sparse matrix can be partitioned into blocks with optimal storage formats, which dramatically affect performance. (Left) Accuracy of Performance model is around 9% (Right) Performance Improvement are around 41%, 50%, 38%, respectively. 21

22