Hadoop Distributed FileSystem on Cloud

Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing capability of cloud compluting motivates developers to build their appliactions on cloud infrastructures that offer high performance and scalability. We propose a modified version of the Hadoop Distributed File System (HDFS) which is able to run on a cloud environment efficiently. This means that it has the ability to replicate data blocks across different physical machines not just among different virtual machines. In this way HDFS provides better reliability and availability of data. The choice of each replication target is based on the mapping between physical and virtual machines that is provided by the cloud provider. In addition in order to export this kind of mapping to HDFS we implemented a web service that is able to bridge HDFS and the cloud provider. 1 Introduction The last few years there is a trend for enterprises in using cloud computing infrastructures providing to the users the illusion of unlimited computational resources as well as large pools of data. As a result cloud computing offers to developers the opportunity to build extremely scalable applications. Consequently many software frameworks which have built to run in clusters can be used in cloud computing environment. Hadoop [1,8,9] is a software framework that supports data-intensive distributed applications running on a cluster as well as on cloud. Hadoop Distributed File System (HDFS) [7] is the file system component of Hadoop which is responsible to distribute replicas of application s data accross the different nodes of the infrastructure which hosts the application. In a cloud environment a physical machine can host multiple virual machines (VM) and the HDFS is aware only of the VMs and not of the physical machines that host these VM. Consequently there is always the possibility that all the VM that will be chosen to store the data replicas, correspond to the same physical machine. In this report we present a modification of the HDFS replication strategy in order to store replicas of the same data block to different VMs that are hosted by different physical machines of the cloud infrastructure. For that purpose the Alphabetical order

2 cloud provider has to export infomation about the correlation between the virtual and physical machines of the cloud system. In order to achieve this kind of bridge between the cloud provider and HDFS we built a web service that could be provided by every cloud provider. Finally we enumerate possible shortcomings as well as we present solutions for these. 2 Background Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data providing a distributed file system. A Hadoop cluster scales storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is a top-level Apache project being built and used by a global community of contributors [2], using the Java programming language. Yahoo! has been the largest contributor to the project [4], and uses Hadoop extensively across its businesses [5]. HDFS is the file system component of Hadoop of which interface is patterned after the UNIX file system. HDFS stores file system metadata and application data separately. A dedicated server, called NameNode, is responsible to store the metadata of the file system. NameNode records include attributes like permissions, modification and access times, namespace and disk space quotas. The servers where the application data are stored called DataNodes. The filesystem uses the TCP/IP layer for communication between servers. The HDFS is designed to handle very large files. The contents of a file are split into large blocks and each block is independently replicated to multiple DataNodes. Each data block is represented by two files on a DataNode s file system. The first file contains the actual data and the other contains the block s metadata such as block s checksum and generation stamp. In order to achieve data reliability and durability HDFS replicates the content of files across multiple DataNodes by following a specific storage and replication policy. This strategy has the added advantage that data transfer bandwidth is multiplied, and there are more opportunities for locating computation near the needed data. User applications can read, write and delete files as well as create and delete directories using HDFS client. HDFS client is a code library that export an interface which allows user applications to access the file system. When an application wants to read a file, the HDFS client asks the NameNode for the list of DataNodes that contain replicas of the blocks of the file. After that the client chooses one of the DataNodes and gets the desired block by directly contacting it. The same way when a client wants to write a data block, it contact the NameNode in order to get a list of DataNodes that will store replicas of the block. Next the client organizes a pipeline and sends the data to be written to all the chosen DataNodes. The user application generally does not need to know that

3 the file system metadata and storage are on different servers, or that blocks have multiple replicas. 3 Challenges Exploiting the cloud computing features that many enterprises provide, cloudified distribition of many services become extremely powerful and flexible. Therefore building the Hadoop Distributed File System on a cloud it becomes able to deliver scalable storage with high performing data processing. However despite the obvious benefits of cloudifing such a service many challenges that have to be encountered are rising. Taking advantage of the benefits that a software platform which implements cloud computing offers, we present a more portable, flexible and mainly more scalable HDFS distribution. To accomplish our implementation we use the Eucalyptus Systems [3][6] which is an open-source software platform that offers high performing cloud solutions. By using the Eucalyptus Systems we create a cluster of virtual machines that constitute the DataDodes of the file system, where all data block replicas are stored. Such a cluster can scale by creating and removing virtual machines (VMs) if needed based on the stored data load. According to the replication policy followed by the HDFS NameNode, the blocks that a file is consisted of, are replicated among different machines in order to maintain data reliability and availability despite possilbe DataNode failures. To build a virtual computing cluster we use some physical computer where a lot of virtual machines are possible to run on. Due to the fact that each physical machine can host tens of VMs, is rising a serious problem when the file system runs on such a cluster. Based on the NameNode s replication policy the data blocks are replicated among different virtual machines and these VMs may correspond to the same physical machime. As a result if a physical machine fails then all data replicas will be lost. This failure impacts the data reliability and availability. Our goal is to achieve a more efficient distribution of file block replicas across different physical machines, not just among different VMs, avoiding all data replica loss. The basic challenge in order to achive our goal is to gather the information about the mapping of the virtual machines to the physical machines. This information should be exported by the cloud provider and by modifing the HDFS replication policy we manage to store replicas of the same data block among different physical machines. An important decision that should be taken is about the way the mapping information is transferred from the cloud provider to the HDFS namenode. Probably this requirement may rise security and privacy issues from the cloud provider s side about its policy. Such issues can be overtaken by implementing a web server at the cloud provider side that export only the information about which of the virtual machines are running on the same physical machine. In Fig. 1 we present an overview of the HDFS where the namenode by calling the web service learns which VMs to exlude at each file block storage. If a block is stored in a specific

4 Fig.1. Overview of the Hadoop Distributed File System on a Cloud virtual machine automatically the namenode exludes the the VMs that belong to the same physical machine with the selected one, from the next replication storage. More information about the web service is going to be discussed in sections 4 and 5. 4 Implementation The HDFS consists of the NameNode and the DataNodes. The first one serves as namespace manager and maintains the namespace and the mapping of file blocks to the datanodes and is responsible for applying the block placement and replication policy. The laters are the nodes where the file blocks are stored based the following policy from the namenode. The replicas of each data block are stored to different DataNode and each DataNode corresponds to a different virtual machine of the cloud. HDFS holds no information about the VMs except of statistics about the data blocks that they host. As a result HDFS does not know if the chosen DataNodes correspond to the same physical machine and consequently all the replicas will be stored to the same physical machine. On the other hand the cloud provider holds all the necessary information about the mapping between virtual and physical machines. In our setup we use Eucalyptus, that hosts the virtual machines which are used by the HDFS. Eucalyptus keeps log files that contain information about every VM. This kind of information include the virtual IP address of a VM and the IP address of the physical machine that hosts the VM as well as the owner of the VM among others. In more details every 20 seconds Eucalyptous updates the file cloudoutput.log that contains information about the VMs that run at that time. In order to gather these information we built a log parser, written in java, that

5 Fig. 2. IP addresses grouping according to the physical machines they belong scans the logfile creating the mapping between the VMs which are in use and the physical machines. HDFS needs to obtain this kind of correlation between VMs and physical machines. For that purpose we have implemented a SOAP-based web service in order to bridge the Eucalytous cloud provider and the HDFS. The web service is implemented at the cloud provider side because only it keeps track about the VMs and the physical machines. On the other hand we have implemented a web service client that is used by the NameNode. In this way we export the necessary information from the cloud provider to HDFS. In more details every VM that is running in Eucalyptous has an owner ID. So the client sends to the web service the owner ID of the VMs that are part of the HDFS and host DataNodes. On the other hand the web service using the log parser filters the logs and creates the correlation between the VMs that has that specific owner ID and the physical machines. We use the owner ID because HDFS, and generally any cloud user, does not need to know anything about the VMs that belong to other cloud users. This mapping is based on the IP addresses of the VMs and the physical machines. In addition the cloud provider may not want to reveal sensitive information such as the IP addresses of the physical machines. For this reason we hide the IP addresses of the physical machines and return the IP addresses of the VMs grouped according to physical machines that they belong. This kind of grouping is illustrated in Fig. 2. This mapping is represented with an arraylist which contains an arraylist in each bucket. The arraylist in each bucket contains the IP addresses of a group of VMs that belong to the same physical machine. This mapping, that is created by the web service, is sent to the client as an encoded string. Once the client receives the necessery information from the web service he reconstructs the same data structure as the web service and can be used by the NameNode in order to distribute the replicas of a data block to DataNodes that belong to different physical machines.

6 The basic component of the namenode that performs most of the filesystem management is the FSNamesystem. When a file block needs to be stored or replicated FSNamensystem is responsible to delegate the Replication Target Chooser component to apply the replica placement stategy. The Replication Target Chooser is respinsible for choosing the desired number of targets for placing the block replicas. If a file block needs to be replicated and stored it means that it is originated from a new file of the file system or it has marked as an under-replicated block. There are many reasons that a block is marked as under-replicated. For example after a datanode failure the blocks that that has been replicated there assigned in the under-replicated queue. Other reasons are the possible corructions for the replicas, and the decommisions phase of a datanode decided by the file system administrator where all the existing replicated blocks marked as underreplicated. Under these circumstances the Replication Target Chooser chooses the desired target based on the replication factor and policy and returns a list of datanodes to the namenode in order to start the storage phase of the replicas. According to our implementation, to achive a reliable Hadoop Distributed File System builded on a cloud we modified the namenode to maintain an extra data structure for the correlation betweens the virtual machines and the physical machines. The data structure consists of a table where each index points to a list containig all the virtual machines that belong to the same physical machine. Responsible to store this information is a web servise client which calls the web service as described above and updates the structure, informing the namenode. For each block that needs replication the Replication Target Chooser is executed until it finds all the desired target of each replica according to the repliaction factor. Each time it starts to look for a new target for a block replica Chooser finds from the corresponding data structure the virtual machines that belong to the same physical machine with already chosen target for the same block. If there is any the Chooser excludes these virtual machines from the next choises. This execution has as a result all the replicas of the same block to be stored on different physical machines. If the replication factor is more than the available physical machines then it is the only case where replicas from the same block can be stored in the same physical machine. 5 Shortcomings Despite the advantages of the powerful, flexible, and scalable distribution we present, we also enumerate a list of possible shortcomings. These mainly have to do with the performance of the HDFS and Web Server interaction and the complexity of maintaining and searching in the data structure with the VMs and physical machines mappings. We evaluated our approach in a cluster of only four VMs were the shortcomings are not visible but they are quite obvious in thought of having a huge number of VMs and physical machines.

7 Inmoredetails,aswepresentedinsection4,weuseSOAP-basedWebService in order to export the mapping of VMs and physical machines from the cloud provider to the client, called by the NameNode. The type of the returned result is an encoded String. After the client gets the result he restructures the content in an Arraylist<Arraylist<String>>. The complexity of this restructure is O(N) (N is the number of VMs) and in case of a huge number of VMs this will increase the time of creating replicas. Also another issue with complexity arise every time the client wants to learn the VMs that belongs to the same Physical machine through that data structure. The complexity of this search in the data structure is also O(N). As a result, when there is a huge number of VMs the time of creating replicas will also be increased. The last issue is that the client calls cloud provider every time it wants to make new replicas of data block. Combined with the above shortcoming this can result much lower performance of the Replication Target Chooser. A better approach on that would be that client keeps creating replicas based on the existing data structure that contains correlations between VMs and physical pachines until the cloud provider informs it that something has been changed to the NetworkTopology in order to update its data structure. In section 6 we will represent some possible solution on how to efface part of these drawbacks. 6 Future Work In this section we discuss future approaches that will improve our approach eliminating the shortcomings that we enumerated in section 5. Furthermore we present the idea of modifying Hadoop s Balancer in order to work on Cloud. First of all an important issue is to change the existing Web Service, which is based on the Soap model, to another model in order to avoid the cost of restructuring data at Clients side. This can make replication much faster than our existing approach. As we mentioned before another problem is the cost of searching the VMs that belongs to the same physical machine in our structure. A more efficient data structure could help to avoid this cost. For example instead of having the presented data structure the structure Arraylist<Arraylist<MyClass>>, where MyClass contains a Vm s IP addresse and the index of the Arraylist that the address is placed. With this kind of approach we are be able to find VMs that belongs to the same Physical machine with complexity O(1). Hadoop s balancer is an administrative tool that balances[ref] disk space usage on an HDFS cluster. The tool is deployed as an application program that can be run by the cluster administrator. It iteratively moves replicas form DataNodes with higher utilization to DataNodes with lower utilization. Balancer quarantees that the decision does not impact the storage and replication strategy. Our approach will focus to use the information about the correlations between VMsandPhysicalmachinesinordertomovereplicastoVMsthatdoesn tbelong

8 to the same Physical machine. This modification would allow ballancer to work more efficient on the Cloud. 7 Conclusions In this report we present a expansion of the Hadoop Distributed File System which is able to be builded more efficient on a cloud. The core idea, in order to become more scalable, powerful and flexible is the importance of a more disctibution of the data replicas across the machines that constitute a virtual cluster on a Cloud. On each physical machine of the cloud can run tens of the virtual machines. In our approach we destribute the data blocks across the physical machines and not just across the virtual machines. In the later case is possible all the replicas of a block to be stored on different VMs but on the same physical machine and a failure of this machines resluts an all replicas loss. To be able to apply tha above distribution the information about the correlation between the VMs and the physical machines is requirred. This information must be exported by the cloud provider so we also present an solution of building a Web Service that gathers the information the sent it to the NameNode of the HDFS. References 1. Apache hadoop. http://hadoop.apache.org/. 2. Applications and organizations using hadoop. http://wiki.apache.org/hadoop/poweredby. 3. Eucalyptus systems. http://www.eucalyptus.com/. 4. Hadoop credits page. http://hadoop.apache.org/core/credits.html. 5. Yahoo! launches world s largest hadoop production application. http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largestproduction-hadoop.html. 6. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID 09, pages 124 131, Washington, DC, USA, 2009. IEEE Computer Society. 7. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 10, pages 1 10, Washington, DC, USA, 2010. IEEE Computer Society. 8. J. Venner. Pro Hadoop. Apress, 2009. 9. T. White. Hadoop: The Definitive Guide. O Reilly Media, Yahoo! Press, 2009.