SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD



Similar documents
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Design and Implementation of Performance Guaranteed Symmetric Load Balancing Algorithm

Balancing the Load to Reduce Network Traffic in Private Cloud

Distributed file system in cloud based on load rebalancing algorithm

IMPACT OF DISTRIBUTED SYSTEMS IN MANAGING CLOUD APPLICATION

Secured Load Rebalancing for Distributed Files System in Cloud

LONG TERM EVOLUTION WITH 5G USING MAPREDUCING TASK FOR DISTRIBUTED FILE SYSTEMS IN CLOUD

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

R.Tamilarasi #1, G.Kesavaraj *2

A Load Balancing Model Using a New Algorithm in Cloud Computing For Distributed File with Hybrid Security

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Enhance Load Rebalance Algorithm for Distributed File Systems in Clouds

LOAD BALANCING FOR OPTIMAL SHARING OF NETWORK BANDWIDTH

An Efficient Distributed Load Balancing For DHT-Based P2P Systems

Survey on Load Rebalancing for Distributed File System in Cloud

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Load Re-Balancing for Distributed File. System with Replication Strategies in Cloud

LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

Scalable Multiple NameNodes Hadoop Cloud Storage System

Load Balancing in Structured P2P Systems

Load Balancing in Structured Peer to Peer Systems

Load Balancing in Structured Peer to Peer Systems

New Structured P2P Network with Dynamic Load Balancing Scheme

AN EFFECTIVE PROPOSAL FOR SHARING OF DATA SERVICES FOR NETWORK APPLICATIONS

Locality-Aware Randomized Load Balancing Algorithms for DHT Networks

Efficient Metadata Management for Cloud Computing applications

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Dynamo: Amazon s Highly Available Key-value Store

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

Redistribution of Load in Cloud Using Improved Distributed Load Balancing Algorithm with Security

SCALABLE RANGE QUERY PROCESSING FOR LARGE-SCALE DISTRIBUTED DATABASE APPLICATIONS *

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

DISTRIBUTION OF DATA SERVICES FOR CORPORATE APPLICATIONS IN CLOUD SYSTEM

Survey on Load Rebalancing For Distributed File System in Cloud with Security

Secure and Privacy-Preserving Distributed File Systems on Load Rebalancing in Cloud Computing

Bounding Communication Cost in Dynamic Load Balancing of Distributed Hash Tables

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Peer to Peer Networks A Review & Study on Load Balancing

Load Balancing in Dynamic Structured P2P Systems

Distributed File Systems

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching

Snapshots in Hadoop Distributed File System

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

Load Balancing in Peer-to-Peer Data Networks

Research on P2P-SIP based VoIP system enhanced by UPnP technology

Join and Leave in Peer-to-Peer Systems: The DASIS Approach

Krunal Patel Department of Information Technology A.D.I.T. Engineering College (G.T.U.) India. Fig. 1 P2P Network

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Achieving Resilient and Efficient Load Balancing in DHT-based P2P Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Tornado: A Capability-Aware Peer-to-Peer Storage Network

An Efficient Deduplication File System for Virtual Machine in Cloud

How To Balance In Cloud Computing

Evaluating HDFS I/O Performance on Virtualized Systems

Introduction to Hadoop

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advance Research in Computer Science and Management Studies

SUITABLE ROUTING PATH FOR PEER TO PEER FILE TRANSFER

Loadbalancing and Maintaining the QoS On Cloud Computing

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Migration of Virtual Machines for Better Performance in Cloud Computing Environment

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

Log Mining Based on Hadoop s Map and Reduce Technique

Adaptive Overlapped Data Chained Declustering In Distributed File Systems

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Peer-VM: A Peer-to-Peer Network of Virtual Machines for Grid Computing

Cyber Forensic for Hadoop based Cloud System

New Algorithms for Load Balancing in Peer-to-Peer Systems

Distributed Data Stores

Clustering in Peer-to-Peer File Sharing Workloads

Parallel Computing. Benson Muite. benson.

NoSQL Data Base Basics

Introduction to Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

International Research Journal of Interdisciplinary & Multidisciplinary Studies (IRJIMS)

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Scala Storage Scale-Out Clustered Storage White Paper

Fault Tolerance in Hadoop for Work Migration

Data Management Challenges in Cloud Computing Infrastructures

Chord - A Distributed Hash Table

Comparative analysis of Google File System and Hadoop Distributed File System

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Jeffrey D. Ullman slides. MapReduce for data intensive computing

A NEW FULLY DECENTRALIZED SCALABLE PEER-TO-PEER GIS ARCHITECTURE

Optimizing and Balancing Load in Fully Distributed P2P File Sharing Systems

Transcription:

International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol-1, Iss.-3, JUNE 2014, 54-58 IIST SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD 1 SHAIK SHAHEENA, 2 SD. AFZAL AHMAD, 3 DR.PRAVEEN SHAM 1 PG SCHOLAR,CSE(CN),QCET, NELLORE 2 ASSOCIATE PROFESSOR, CSE, QCET, NELLORE 3 PROFESSOR, CSE, GPREC, KURNOOL ABSTRACT:Distributed file systems are key building blocks for cloud computing applications based on the MapReduce programming paradigm. In such file systems, nodes simultaneously serve computing and storage functions; a file is partitioned into a number of chunks allocated in distinct nodes so that MapReduce tasks can be performed in parallel over the nodes. However, in a cloud computing environment, failure is the norm, and nodes may be upgraded, replaced, and added in the system. Files can also be dynamically created, deleted, and appended. This results in load imbalance in a distributed file system; that is, the file chunks are not distributed as uniformly as possible among the nodes. Emerging distributed file systems in production systems strongly depend on a central node for chunk reallocation. This dependence is clearly inadequate in a large-scale, failure-prone environment because the central load balancer is put under considerable workload that is linearly scaled with the system size, and may thus become the performance bottleneck and the single point of failure. In this paper, a fully distributed load rebalancing algorithm is presented to cope with the load imbalance problem. Our algorithm is compared against a centralized approach in a production system and a competing distributed solution presented in the literature. The simulation results indicate that our proposal is comparable with the existing centralized approach and considerably outperforms the prior distributed algorithm in terms of load imbalance factor, movement cost, and algorithmic overhead. The performance of our proposal implemented in the Hadoop distributed file system is further investigated in a cluster environment. 1. INTRODUCTION: Distributed computing is a field of computer science that Distributed computing also refers to the use of distributed studies distributed systems. A distributed system is a systems to solve computational problems. In distributed software system in which components located on computing, a problem is divided into many tasks, each of networked computers communicate and coordinate their which is solved by one or more computers, which actions by passing messages. The components interact communicate with each other by message passing. with each other in order to achieve a common goal. There are many alternatives for the message passing mechanism, including RPC-like connectors and message queues. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. An important goal and challenge of distributed systems is location transparency. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are several autonomous computational entities, each of which has its own local memory. The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area. The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing. While there is no single definition of a distributed system, the following defining properties are commonly used: The entities communicate with each other by message passing. 54

2. EXISTING SYSTEM: State-of-the-art distributed file systems (e.g., Google GFS and Hadoop HDFS) in clouds rely on central nodes to manage the metadata information of the file systems and to balance the loads of storage nodes based on that metadata. The centralized approach simplifies the design and implementation of a distributed file system. However, recent experience concludes that when the number of storage nodes, the number of files and the number of accesses to files increase linearly, the central nodes (e.g., the master in Google GFS) become a performance bottleneck, as they are unable to accommodate a large number of file accesses due to clients and MapReduce applications. 2.1 DISADVANTAGES OF EXISTING SYSTEM:The most existing solutions are designed without considering both movement cost and node heterogeneity and may introduce significant maintenance network traffic to the DHTs. 3. PROPOSED SYSTEM: In this paper, we are interested in studying the load rebalancing problem in distributed file systems specialized for large-scale, dynamic and data-intensive clouds. (The terms rebalance and balance are interchangeable in this paper.) Such a large-scale cloud has hundreds or thousands of nodes (and may reach tens of thousands in the future). Our objective is to allocate the chunks of files as uniformly as possible among the nodes such that no node manages an excessive number of chunks. Additionally, we aim to reduce network traffic (or movement cost) caused by rebalancing the loads of nodes as much as possible to maximize the network bandwidth available to normal applications. Moreover, as failure is the norm, nodes are newly added to sustain the overall system performance,resulting in the heterogeneity of nodes. Exploiting capable nodes to improve the system performance is, thus, demanded. Our proposal not only takes advantage of physical network locality in the reallocation of file chunks to reduce the movement cost but also exploits capable nodes to improve the overall system performance. 3.1 ADVANTAGES OF PROPOSED SYSTEM: This eliminates the dependence on central nodes. Our proposed algorithm operates in a distributed manner in which nodes perform their loadbalancing tasks independently without synchronization or global knowledge regarding the system. Algorithm reduces algorithmic overhead introduced to the DHTs as much as possible. 4.IMPLEMENTATION: 4.1 MODULES: Module creation DHT formulation Load balancing algorithm Replica Management 4.1.1 MODULES DESCRIPTION: Module Creation Our objective is to allocate the modules of files as uniformly as possible among the nodes such that no node 55

manages an excessive number of modules. A file is partitioned into a number of modules allocated in different nodes so that Map Reduce Tasks can be performed in parallel over the nodes. Because the files in a cloud can be dynamically created, deleted, and appended, and nodes can be upgraded, replaced and added in the file system, the file modules are distributed as uniformly as possible among the nodes. DHT Formulation The module servers in our proposal are organized as a DHT network. Typical DHTs guarantee that if a node leaves, then its locally hosted modules are reliably migrated to its successor; if a node joins, then it allocates the modules whose IDs immediately precede the joining node from its successor to manage. DHT s, given that a unique handle (or identifier) is assigned to each file module. The storage nodes are structured as a network based on distributed hash tables (DHTs), DHTs enable nodes to self-organize and repair while constantly offering lookup functionality in node dynamism, simplifying the system provision and management. Load Balancing Algorithm In our proposed algorithm, each module server node I first estimate whether it is under loaded (light) or overloaded (heavy) without global knowledge. A node is light if the number of modules it hosts is smaller than the threshold. First of all we will find the lightest node to take the set of modules from heaviest node. So we can do the process without failure. Load balancing is a technique to distribute workload across many computers or network to achieve maximum resource utilization, maximize throughput, minimize response time, and avoid overload. The load equalization service is sometimes provided by dedicated software package or hardware, like a multilayer switch or name server. Graph evaluation: The overall dynamic resource allocation in large cloud environment is evaluated and displayed as pie graph. The graph shows the overall usage of the virtual machines in the cloud. The graph shows the maximum utility of the Virtual machine under memory constraints. Equal usage of all virtual machines in the cloud is shown in the graph. 5. ARCHITECTURE: 5.1 DATA FLOW DIAGRAM: 1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of input data to the system, various processing carried out on this data, and the output data is generated by this system. 2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system. 3. DFD shows how the information moves through the system and how it is modified by a series of 56

transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output. 4. DFD is also known as bubble chart. A DFD may b e represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail. u s e d t o 6.CONCLUSION: A novel load-balancing algorithm to deal with the load rebalancing problem in large-scale, dynamic, and distributed file systems in clouds has been presented in this paper. Our proposal strives to balance the loads of nodes and reduce the demanded movement cost as much as possible, while taking advantage of physical network locality and node heterogeneity. In the absence of representative real workloads (i.e., the distributions of file chunks in a largescale storage system) in the public domain, we have investigated the performance of 57

our proposal and compared it against competing algorithms through synthesized probabilistic distributions of file chunks. The synthesis workloads stress test the load-balancing algorithms by creating a few storage nodes that are heavily loaded. The computer simulation results are encouraging, indicating that our proposed algorithm performs very well. Our proposal is comparable to the centralized algorithm in the Hadoop HDFS production system and dramatically outperforms the competing distributed algorithm in terms of load imbalance factor, movement cost, and algorithmic overhead. Particularly, our load-balancing algorithm exhibits a fast convergence rate. The efficiency and effectiveness of our design are further validated by analytical models and a real implementation with a small-scale cluster environment. REFERENCES [1] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proc. Sixth Symp. Operating System Design and Implementation (OSDI 04), pp. 137-150, Dec. 2004. [2] S. Ghemawat, H. Gobioff, and S.-T. Leung, The Google File System, Proc. 19th ACM Symp. Operating Systems Principles (SOSP 03), pp. 29-43, Oct. 2003. [8] K. McKusick and S. Quinlan, GFS: Evolution on Fast-Forward, Comm. ACM, vol. 53, no. 3, pp. 42-49, Jan. 2010. [9] HDFS Federation, http://hadoop.apache.org/common/docs/ r0.23.0/hadoop-yarn/hadoop-yarn-site/federation.html, 2012. [10] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F. Dabek, and H. Balakrishnan, Chord: A Scalable Peer-to- Peer Lookup Protocol for Internet Applications, IEEE/ACM Trans. Networking, vol. 11, no. 1, pp. 17-21, Feb. 2003. [11] A. Rowstron and P. Druschel, Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems, Proc. IFIP/ACM Int l Conf. Distributed Systems Platforms Heidelberg, pp. 161-172, Nov. 2001. [12] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazon s Highly Available Key-Value Store, Proc. 21st ACM Symp. Operating Systems Principles (SOSP 07), pp. 205-220, Oct. 2007. [13] A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica, Load Balancing in Structured P2P Systems, Proc. Second Int l Workshop Peer-to-Peer Systems (IPTPS 02), pp. 68-79, Feb. 2003. [14] D. Karger and M. Ruhl, Simple Efficient Load Balancing Algorithms for Peer-to-Peer Systems, Proc. 16th ACM Symp. Parallel Algorithms and Architectures (SPAA 04), pp. 36-43, June 2004. [15] P. Ganesan, M. Bawa, and H. Garcia-Molina, Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems, Proc. 13th Int l Conf. Very Large Data Bases (VLDB 04), pp. 444-455, Sept. 2004. [16] J.W. Byers, J. Considine, and M. Mitzenmacher, Simple Load Balancing for Distributed Hash Tables, Proc. First Int l Workshop Peer-to-Peer Systems (IPTPS 03), pp. 80-87, Feb. 2003. [17] G.S. Manku, Balanced Binary Trees for ID Management and Load Balance in Distributed Hash Tables, Proc. 23rd ACM Symp. Principles Distributed Computing (PODC 04), pp. 197-205, July 2004. [18] A. Bharambe, M. Agrawal, and S. Seshan, Mercury: Supporting Scalable Multi-Attribute Range Queries, Proc. ACM SIGCOMM 04, pp. 353-366, Aug. 2004. [3] Hadoop Distributed File System, http://hadoop.apache.org/hdfs/, 2012. [4] VMware, http://www.vmware.com/, 2012. [5] Xen, http://www.xen.org/, 2012. [6] Apache Hadoop, http://hadoop.apache.org/, 2012. [7] Hadoop Distributed File System Rebalancing Blocks, http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing, 2012. 58