Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.



Similar documents
SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

Distributed file system in cloud based on load rebalancing algorithm

Survey on Load Rebalancing for Distributed File System in Cloud

Load Re-Balancing for Distributed File. System with Replication Strategies in Cloud

Secured Load Rebalancing for Distributed Files System in Cloud

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

IMPACT OF DISTRIBUTED SYSTEMS IN MANAGING CLOUD APPLICATION

Balancing the Load to Reduce Network Traffic in Private Cloud

Enhance Load Rebalance Algorithm for Distributed File Systems in Clouds

How To Balance In Cloud Computing

An Efficient Distributed Load Balancing For DHT-Based P2P Systems

Secure and Privacy-Preserving Distributed File Systems on Load Rebalancing in Cloud Computing

LONG TERM EVOLUTION WITH 5G USING MAPREDUCING TASK FOR DISTRIBUTED FILE SYSTEMS IN CLOUD

Design and Implementation of Performance Guaranteed Symmetric Load Balancing Algorithm

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Adaptive Overlapped Data Chained Declustering In Distributed File Systems

Distributed File Systems

Comparative analysis of Google File System and Hadoop Distributed File System

Survey on Load Rebalancing For Distributed File System in Cloud with Security

LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM

R.Tamilarasi #1, G.Kesavaraj *2

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

Efficient Metadata Management for Cloud Computing applications

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Distributed File Systems

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

International Journal of Advance Research in Computer Science and Management Studies

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

The Google File System

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

A Load Balancing Model Using a New Algorithm in Cloud Computing For Distributed File with Hybrid Security

Data-Intensive Computing with Map-Reduce and Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Snapshots in Hadoop Distributed File System

HDFS Space Consolidation

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

The Google File System

Distributed File Systems

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Redistribution of Load in Cloud Using Improved Distributed Load Balancing Algorithm with Security

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

An Efficient Deduplication File System for Virtual Machine in Cloud

Comparison of Distributed File System for Cloud With Respect to Healthcare Domain

Distributed Metadata Management Scheme in HDFS

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Fault Tolerance in Hadoop for Work Migration

Load Balancing in Structured P2P Systems

Big Fast Data Hadoop acceleration with Flash. June 2013

Scala Storage Scale-Out Clustered Storage White Paper

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Cloud Computing at Google. Architecture

Large Scale Distributed File System Survey

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

RevoScaleR Speed and Scalability

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

SCHEDULING IN CLOUD COMPUTING

Log Mining Based on Hadoop s Map and Reduce Technique

MapReduce and Hadoop Distributed File System

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

HDFS scalability: the limits to growth

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Research on Job Scheduling Algorithm in Hadoop

Google File System. Web and scalability

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Hadoop & its Usage at Facebook

Evaluating HDFS I/O Performance on Virtualized Systems

Scalable Multiple NameNodes Hadoop Cloud Storage System

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Load Balancing in Structured Peer to Peer Systems

Load Balancing in Structured Peer to Peer Systems

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Hadoop: Embracing future hardware

Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

Accelerating and Simplifying Apache

Mobile Cloud Computing for Data-Intensive Applications

NoSQL Data Base Basics

How To Build Cloud Storage On Google.Com

A Study on Data Analysis Process Management System in MapReduce using BPM

Krunal Patel Department of Information Technology A.D.I.T. Engineering College (G.T.U.) India. Fig. 1 P2P Network

Parallel Processing of cluster by Map Reduce

Cloud computing doesn t yet have a

Processing of Hadoop using Highly Available NameNode

Ant-based Load Balancing Algorithm in Structured P2P Systems

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

LOAD BALANCING FOR OPTIMAL SHARING OF NETWORK BANDWIDTH

Transcription:

Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated to Visvesvaraya technological University smitajs21@gmail.com, sannakkisanjeev@git.edu A B S T R A C T Distributed file systems are key building blocks for cloud computing applications based on the MapReduce programming paradigm. In such file systems, nodes simultaneously serve computing and storage functions; a file is partitioned into a number of chunks allocated in distinct nodes so that MapReduce tasks can be performed in parallel over the nodes. However, in a cloud computing environment, failure is the norm, and nodes may be upgraded, replaced, and added in the system. Files can also be dynamically created, deleted, and appended. This results in load imbalance in a distributed file system; that is, the file chunks are not distributed as uniformly as possible among the nodes. Emerging distributed file systems in production systems strongly depend on a central node for chunk reallocation. This dependence is clearly inadequate in a largescale, failure-prone environment because the central load balancer is put under considerable workload that is linearly scaled with the system size, and may thus become the performance bottleneck and the single point of failure. In this paper, a fully distributed load rebalancing algorithm is presented to cope with the load imbalance problem. Our algorithm is compared against a centralized approach in a production system and a competing distributed solution presented in the literature. The simulation results indicate that our proposal is comparable with the existing centralized approach and considerably outperforms the prior distributed algorithm in terms of load imbalance factor, movement cost, and algorithmic overhead. The performance of our proposal implemented in the Hadoop distributed file system is further investigated in a cluster environment. Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. I. INTRODUCTION CLOUD Computing is a compelling technology. In clouds, clients can dynamically allocate their resources on-demand without sophisticated deployment and management of resources. Key enabling technologies for clouds include the MapReduce programming paradigm, distributed file systems, virtualization, and so forth. These techniques emphasize scalability, so clouds can be large in scale, and comprising entities can arbitrarily fail and join while maintaining system reliability. Distributed file systems are key building blocks for cloud computing applications based on the MapReduce programming paradigm. In such file systems, nodes simultaneously serve computing and storage functions; a file is partitioned into a number of chunks allocated in distinct nodes so that MapReduce tasks can be performed in parallel over the nodes. For example, consider a word count application that counts the number of distinct words and the frequency of each unique word in a large file. In such an application, a cloud partitions the file into a large number of disjointed and fixed-size pieces (or file chunks) and assigns them to different cloud storage 295 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org

nodes (i.e., chunkservers). Each storage node (or node for short) then calculates the frequency of each unique word by scanning and parsing its local file chunks. In such a distributed file system, the load of a node is typically proportional to the number of file chunks the node possesses. Because the files in a cloud can be arbitrarily created, deleted, and appended, and nodes can be upgraded, replaced and added in the file system, the file chunks are not distributed as uniformly as possible among the nodes. Load balance among storage nodes is a critical function in clouds. In a load-balanced cloud, the resources can be well utilized and provisioned, maximizing the performance of MapReduce-based applications. In this paper, we are interested in studying the load rebalancing problem in distributed file systems specialized for large-scale, dynamic and data-intensive clouds. (The terms rebalance and balance are interchangeable used.) Such a large-scale cloud has hundreds or thousands of nodes (and may reach tens of thousands in the future). Our objective is to allocate the chunks of files as uniformly as possible among the nodes such that no node manages an excessive number of chunks. Additionally, we aim to reduce network traffic (or movement cost) caused by rebalancing the loads of nodes as much as possible to maximize the network bandwidth available to normal applications. Moreover, as failure is the norm, nodes are newly added to sustain the overall system performance resulting in the heterogeneity of nodes. Exploiting capable nodes to improve the system performance is, thus, demanded. II. LOAD REBALANCING ALGORITHM The load rebalancing problem in distributed file systems specialized for large-scale, dynamic and dataintensive clouds to allocate the chunks of files as uniformly as possible among the nodes of chunks and aims to reduce network traffic (or movement cost ). We consider a large-scale distributed file system consisting of a set of chunkservers V in a cloud, where the cardinality of V is V =n. Typically, n can be 1,000, 10,000, or more. In the system, a number of files are stored in the n chunkservers. First, let us denote the set of files as F. Each file f F is partitioned into a number of disjointed, fixed- size chunks denoted by C f. For example, each chunk has the same size, 64 bytes, in Hadoop HDFS [4]. Second, the load of a chunkserver is proportional to the number of chunks hosted by the server [3]. Third, node failure is the norm in such a distributed system, and the chunkservers may be upgraded, replaced and added in the system. Finally, the files in F may be arbitrarily created, deleted, and appended. The net effect results in file chunks not being uniformly distributed to the chunkservers. Our objective in the current study is to design a load rebalancing algorithm to reallocate file chunks such that the chunks can be distributed to the system as uniform uniformly as possible while reducing the movement cost as much as possible. Here, the movement cost is defined as the number of chunks migrated to balance the loads of the chunkservers. Let A be the ideal number of chunks that any chunkserver i V is required to manage in a system-wide load-balanced state, A. Necessity of Load Rebalancing Eliminates the dependence on central nodes. Operates in a distributed manner in which nodes perform their load-balancing tasks independently without synchronization or global knowledge regarding the system. Reduce algorithmic overhead introduced to the DHTs as much as possible. 296 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org

III. SYSTEM DESIGN System Architecture shows a small-scale cluster environment consisting of a single, dedicated name node and 25 data nodes. Specifically, the name node is equipped with Intel Core 2 processor and 1 Gbytes RAM. As the number of file chunks in our experimental environment is small, the RAM size of the name node is sufficient to cache the entire name node process and the metadata information, including the directories and the locations of file chunks. In proposed work, experimental environment, a number of clients are established to issue requests to the name node. The requests include commands to create directories with randomly designated names, to remove directories arbitrarily chosen, etc Name Node Data node Data node Load Balancer Data node Network Clients Figure 1 System Architecture A. Load Balancer Fig. 2 Shows load balancer, A file is partitioned into a number of chunks allocated in distinct nodes so that MapReduce tasks can be performed in parallel over the nodes. For example, consider a word count application that counts the number of distinct words and the frequency of each unique word in a large file. In such an application, a cloud partitions the file into a large number of disjointed and fixed-size pieces (or file chunks) and assigns them to different cloud storage nodes (i.e., chunkservers). 297 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org

Flowchart: Annexure 1. IV. CONCLUSION A novel load-balancing algorithm to deal with the load rebalancing problem in large-scale, dynamic, and distributed file systems. Proposed System will balance the loads of nodes and reduce the demanded movement cost as much as possible, while taking advantage of physical network locality and node heterogeneity. Proposal is comparable to the centralized algorithm in the Hadoop HDFS production system. Proposed Load-balancing algorithm exhibits a fast convergence rate. Our proposal is comparable to the centralized algorithm in the Hadoop HDFS production system and dramatically out- performs the competing distributed algorithm in terms of load imbalance factor, movement cost, and algorithmic overhead. Particularly, our load-balancing algorithm exhibits a fast convergence rate. V. REFERENCES [1] Hung-Chang Hsiao, Member, IEEE Computer Society, Hsueh-Yi Chung, Haiying Shen, Member, IEEE, and Yu-Chang Chao, Load Rebalancing for Distributed File Systems in Clouds, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013. 298 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org

[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, The Google File System, Proc. 19th ACM Symp. Operating Systems Principles (SOSP 03),pp. 29-43, Oct. 2003. [3] Hadoop Distributed File System, http://hadoop.apache.org/ hdfs/, 2012. [4] VMware, http://www.vmware.com/, 2012. [5] Xen, http://www.xen.org/, 2012. [6] Apache Hadoop, http://hadoop.apache.org/, 2012. [7] Hadoop Distributed File System Rebalancing Blocks, http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing, 2012. [8 ] K. McKusick and S. Quinlan, GFS: Evolution on Fast-Forward, Comm. ACM,vol. 53, no. 3, pp. 42-49, Jan. 2010. [9] HDFS Federation, http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarnsite/federation.html. [10] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek,F. Dabek, and H. Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications, 299 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org

Annexure 1. 300 2015, IJAFRC and NCRTIT 2015 All Rights Reserved www.ijafrc.org