Integration Of Virtualization With Hadoop Tools Aparna Raj K aparnaraj.k@iiitb.org Kamaldeep Kaur Kamaldeep.Kaur@iiitb.org Uddipan Dutta Uddipan.Dutta@iiitb.org V Venkat Sandeep Sandeep.VV@iiitb.org Technical Report IIITB-TR-2012-006 April 2012
Abstract Processing large amount of data, in a very less time is a major requirement of present day world. The Hadoop Distributed File System (HDFS) [1] and MapReduce [1] are two major components contributing to this task. Even though, MapReduce can process large amount of data, the focus is always on obtaining maximum performance with optimal resources. The main goal of the project is to integrate Hadoop MapReduce framework with virtualization to achieve better resource utilization. We present a virtualized setup of a Hadoop cluster that provides greater computing capacity with lesser resources, since a virtualized cluster requires fewer physical machines. The master node of the cluster is set up on a physical machine, and slave nodes are set up on virtual machines (VMs) that may be on a common physical machine. Hadoop configured VM images are created by cloning of VMs, which facilitates fast addition and deletion of nodes in the cluster without much overhead. Also, we have configured the Hadoop virtual cluster to use capacity scheduler instead of the default FIFO scheduler. The capacity scheduler schedules tasks based on the availability of RAM and virtual memory (VMEM) in slave nodes before allocating any job. So instead of queuing up the jobs, they are efficiently allocated on the VMs based on the memory available. Various configuration parameters of Hadoop are analyzed and the virtual cluster is fine-tuned to ensure best performance and maximum scalability. Project source can be downloaded from the following URL URL: https://sourceforge.net/p/hadoop c 2012 Aparna Raj Koroth, Kamaldeep Kaur Randhawa, Uddipan Dutta, Velugoti Venkat Sandeep. This material is available under the Creative Commons Attribution-Noncommercial- Share Alike Licenses. See http://creativecommons.org/licenses/by-nc-sa/3.0/ for details. 2
Acknowledgement We are very thankful to the people who have helped and supported us during the project. We express our deepest thanks to our advisor, Prof. Shrisha Rao for his encouraging words and useful suggestions throughout this work. It was a pleasure for us to work under his guidance. We would also like to thank Vinaya MS, one of the alumni of our college for her valuable suggestions. 3
Contents 1 Introduction 6 2 System Design 7 2.1 Architecture............................ 8 2.2 Virtualization........................... 9 3 Implementation Details 10 3.1 Hadoop Configuration...................... 10 3.2 VM Installation and Configuration............... 14 3.3 Setting Up Virtual Hadoop Cluster............... 15 4 Performance Metrics 17 5 Conclusion 23 References 24 4
List of Tables 1 Real vs Virtual Cluster Runtime................. 18 2 Input Size vs Time Taken.................... 19 3 Cluster Size vs Time Taken................... 20 List of Figures 1 Modified Architecture of Virtual Hadoop Cluster....... 8 2 Graph for Real Vs Virtual Cluster Runtime........... 18 3 Graph for Input Size vs Time Taken.............. 19 4 Graph for Cluster Size vs Time Taken............. 20 5 Graph for analysis of MapReduce jobs............. 21 5
1 Introduction With significant advances in information technology, the amount of data being processed in the computing world, is increasing day by day. As a result, the need of a new kind of infrastructure and system, that can handle large amount of data, is on high demand. Google designed a programming model for cloud computing, called MapReduce [1], as a possible solution for managing large amounts of data sets. Later Hadoop, from the Apache project [1], came up with a more popular open source implementation for MapReduce. MapReduce is a software framework for processing huge volumes of data. It is done on big clusters of compute nodes, to process the data parallely. As the amount of data processed is huge, a lot of resources have to be quickly coordinated and allocated. Though it is true that Hadoop gives good results on physical machines, service providers find it difficult to manage incoming requests quickly. Virtualization can solve such management problems for Hadoop and this project aims at enhancing the Hadoop environment with quick resource allocation capabilities. Objective The major goal of the project is the integration of Hadoop tools with virtualization using Hadoop MapReduce framework. The project aims at building a virtualized environment for the Hadoop version 0.20.2 such that maximum performance is obtained with optimal resources. It also focuses to understand the most suitable configuration of Hadoop for best performance on virtualized environment. Description Hadoop MapReduce computes large amounts of data in a reasonable time, due to its high scalability. This is achieved by dividing the job in small tasks that can be split through a large collection of computers. As part of the MapReduce implementation, Hadoop has an internal scheduler for managing these incoming requests. The capabilities of the scheduler are enhanced by configuring a new scheduler, called the capacity scheduler [2]. This scheduler is more suited for virtualized Hadoop cluster for memory management than the default Hadoop scheduler. In the event of any unexpected, unusual high amount of data, more VMs can be added to the cluster via cloning or through normal creation of a VM. 6
Gap Analysis Though Hadoop MapReduce works well on physical systems, quick provisioning of resources is a problem. Setting up and configuring Hadoop on a new physical machine and adding it as a node to an existing Hadoop cluster, is time consuming. But with virtualization, a new node configured with Hadoop, can be easily created with cloning of an already configured VM. i. As the computations involved are large, a lot of resources should be quickly coordinated and properly allocated. If there are a large number of requests at a particular time, and if the Hadoop environment requires more resources to process these requests, adding and configuring a new physical machine would take a considerable amount of time. ii. For a large job, adding more TaskTrackers [1] to the cluster will help in faster computations, but there is no flexibility in adding or removing nodes from a Hadoop cluster, setup entirely on physical machines. iii. The JobTracker [1], which acts as the master node in the cluster, is a single point of failure. The project aims at enhancing the Hadoop environment with more virtualized infrastructure so that the Hadoop cluster can be easily extended to a larger cluster. If more nodes are required to finish the jobs, in a cluster, a VM can be cloned to produce a new node. This will be created from a predefined machine image with the required Hadoop configuration and software. As it is already configured according to the requirements, it can be added to the Hadoop cluster easily. This will solve the problem of quick resource provisioning on Hadoop. With multiple VMs running, the overall utilization of the clusters will be improved. 2 System Design The basic components involved are : i. MapReduce framework. ii. HDFS [1] which provides the data storage required by the MapReduce framework. iii. Oracle VM VirtualBox [3] manages creation,cloning and maintenance of VMs. 7
2.1 Architecture To analyze the performance of Hadoop clusters with virtualization, clusters were designed with virtual nodes. A virtual Hadoop cluster was setup with master node on the physical machine and slave nodes on VM. The architecture of the system is depicted in Figure 1. Figure 1: Modified Architecture of Virtual Hadoop Cluster The Figure 1 shows a virtual hadoop cluster on a single host machine. The master node of the cluster is the host machine or the physical machine. The Hadoop deamons, Namenode and Jobtracker are run on the master node. Other slave nodes required to form the cluster is set up on VMs. Multiple VMs are set up as slave nodes on which the hadoop deamons, Datanode and Tasktracker [1] are run. The addition of a new slave node can be easily achieved by cloning of an already configured VM. MapReduce Framework MapReduce is a programming model that processes large amount of data, such as web logs, crawled documents, etc. Though the computations involved in MapReduce are more or less straightforward, the input will be very large.if it is distributed across a large number of machines the big computations can be finished within a reasonable amount of time. The nodes in the cluster are categorized as: i. Master node or JobTracker. 8
ii. Slave node or TaskTracker The master node or the JobTracker, receives input and splits the job into smaller tasks and assign each split to slave nodes or TaskTrackers for computation. HDFS - The data storage for MapReduce Hadoop has a distributed file system within it, the HDFS. Computation and data storage is done on the same machines by making use of this file system. The use of another dedicated storage is thus avoided. HDFS also follows the client-server architecture just like MapReduce. The master node is called NameNode [1] and the slave nodes are termed as DataNodes [1]. The NameNode stores and co-ordinates all the metainformation of the file system. Each node in the cluster is a DataNode. Capacity Scheduler Capacity scheduler [2] is a new scheduler for Hadoop.By using this scheduler, it is possible to reduce the execution time of jobs. This is achieved by taking the memory capacity of the nodes into consideration while scheduling the jobs. It also helps in achieving improved throughput. The utilization of the cluster can also be increased. Jobs are submitted to the multiple queues of the capacity scheduler. The jobs allocated and the fraction of the capacity allotted is balanced by the scheduler to have a uniform allocation. The queues support job priorities as well. Each queue enforces an upper limit on the percentage of resource allocation, per user. This avoids only few jobs from dominating the resources. Memory-intensive jobs are well supported by this scheduler. Jobs can specify higher memory requirements if necessary, and such jobs will be run only on TaskTrackers with more spare memory. 2.2 Virtualization Though Hadoop works on physical machines and provides good result, it is difficult to quickly add a new node to the cluster. Virtualization [4] can solve such problems for Hadoop, as it does for other software solutions. By virtualizing the nodes in the Hadoop cluster, better resource utilization can be ensured. It also facilitates quick addition and deletion of nodes without much overhead. Partitioning each node to a number of virual machines (VMs), gives us a number of benefits: 9
i. Scheduling: With VMs, hardware utilization can be increased. When more computing capacity is required for scheduling batch jobs, the unused capacity of the hardware can be used by the help of virtualization. ii. Resource Utilization: Different kinds of VMs can be hosted on the same physical machine.hadoop VMs can be used along with VMs for other tasks. This results in better resource utilization by consolidating different kinds of applications. iii. Datacenter Efficiency: The types of tasks that can be run on a virtualized infrastructure are more as it is possible to run cross platform applications as well. iv. Deployment: Deployment time for new nodes in a Hadoop cluster can be greatly reduced by virtualization. Configuring Hadoop on a machine can be done quickly by cloning an already configured VM. By creating a Hadoop cluster of VM, better resource utilization can be achieved. The Hadoop environment is modified to use the capacity scheduler. Its capability to schedule tasks based on a job s memory requirements in terms of RAM and virtual memory on a slave node, makes it better suited in a virtual Hadoop cluster. 3 Implementation Details Setting up of virtual Hadoop cluster involves many tasks. They are listed in this section, with the required configurations and settings. After configuring Hadoop on VMs, virtual Hadoop cluster is setup with a physical machine as master node and VMs as slave nodes. The scheduling of jobs in the virtual cluster is enhanced by configuring the capacity scheduler on the master node. 3.1 Hadoop Configuration Hadoop version 0.20.2 was installed and a cluster of multiple nodes was set up. The pre-requisites [5] for this are mentioned below - Pre-requisites : Java 1.6 with JAVA_HOME Environment Variable set to the Java installation location 10
Secure Shell (SSH) Server and Client and SSH key generated for each machine in the cluster Adding a dedicated Hadoop system user : A dedicated user account [5] was set up for running Hadoop. This helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.this provides better security also. SSH Access : Hadoop requires SSH access to manage its nodes, as the nodes are on different systems or VMs. For our multi-node setup of Hadoop, SSH access was setup between the dedicated user accounts created on the machines. It was then configured to allow SSH public key authentication. It is necessary to generate an SSH key [5] for each machine. The scripts to start Hadoop will need to be able to log into the other machines in the cluster without entering a password. This can be done with the following commands: ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub» $HOME/.ssh/authorized_keys Typically one machine in the cluster is designated as the NameNode and another machine as the JobTracker, exclusively. These are the master nodes of the cluster. Other machines in the cluster act as both DataNode and TaskTracker. These are the slave nodes. The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster have the same HADOOP_HOME path. Hadoop configuration is driven by the following site-specific configuration files [5]: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml Site Configuration: The Hadoop daemons, NameNode/DataNode and JobTracker/TaskTracker are configured to run MapReduce job on the cluster setup. 11
In conf/core-site.xml: The Uniform Resource Identifier (URI) of the namenode is specified in this file. core-site.xml [5] <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base for other temporary directories</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description> The name of the default file system. </description> </property> Configuring HDFS: In conf/hdfs-site.xml: Path on the local FileSystem where the NameNode stores the namespace and transaction logs persistently is specified in this file. It also specifies paths on the local FileSystem of a DataNode where it stores its blocks. hdfs-site.xml [5] <property> <name>dfs.replication</name> <value>4</value> <description> Default block replication.the actual number of replications can be specified when the file is created. </description> </property> 12
Configuring Mapreduce: In conf/mapred-site.xml: Parameters related to JobTracker and Task- Trackers are specified here. Host or IP and port of JobTracker, is specified. The path on the HDFS where the MapReduce framework stores the data files which are stored as input to various jobs, is specified in mapred-site.xml. If a different scheduler than the default scheduler is used, that is also specified here. mapred-site.xml [5] <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description> The host and port that the MapReduce jobtracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Configuring the Capacity Scheduler: Capacity Scheduler was built from source by executing ant package. The obtained jar file of capacity scheduler was placed in hadoop/build/contrib/ folder. The hadoop master node was then configured to use capacity scheduler instead of the default scheduler. mapred-site.xml [5] <property> <name>mapred.job.tracker.taskscheduler</name> <value>org.apache.hadoop.mapred.capacitytasksceduler</value> <description> The scheduler which is to be used by the jobtracker </description> </property> HADOOP CLASSPATH [5] in conf/hadoop-env.sh was modified by specifying the capacity-scheduler jar. Then the file Mapred-site.xml was 13
modified to include a new property, mapred.jobtracker.taskscheduler with value org.apache.hadoop.mapred.capacitytaskscheduler 3.2 VM Installation and Configuration Oracle VM VirtualBox [3], a software package for creating and maintaining VMs, was used for creating VMs. Oracle VM VirtualBox was installed on an existing host operating system as an application. This host application allows additional guest operating systems, each known as a Guest OS, to be loaded and run, each with its own virtual environment. First, a virtual hard disk image was created, with base memory 512 MB and a maximum of 8GB storage. For more flexible storage management, a dynamically allocated virtual disk file was created. This will only use space on the physical hard disk as it fills up. Though it will not shrink again automatically when space is freed, it has much better performance than a statically allocated hard disk. Initially the disk will be very small and will not occupy any space for unused virtual disk sectors. Then the disk will grow every time a disk sector is written to for the first time, until the drive reaches the maximum capacity chosen. A VM was created with the above specification, and Ubuntu 11.04 was installed as the guest operating system. VM Configurations Network Settings for VM: To establish connection between the VM and the host machine, network settings were changed. Bridge utils, vtun and uml utilites [6] were installed on the host machine. A bridge and a tap device were created on the host machine for the guest to use. Two network adapters were enabled in the VM. Along with the default Network Address Translation (NAT) Adapter [6], a bridged adapter was also configured to use the tap device created on the host. With these settings SSH connection was successfully established between the Host and the VM. The process of creating the bridge and the tap device, on the host machine, was automated. A script was scheduled to run at boot time. Thus at system startup, the bridge and tap device was created automatically. For extending it to mutiple machines, multiple tap devices can be created and each VM can be attached to one device. 14
Hadoop On VM: The Sun Java was installed on the VM, after adding the necessary repository to the VM. SSH was then installed and configured on the VM. The Hadoop source folder was copied to the VM, by using the scp comand, i.e., secure copy command [5]. scp is used for secure copying of data. First a single node cluster was run on the VM. After ensuring the correct working of hadoop on VM, it was configured as a slave node in the cluster. Cloning VM: A virtual image of the Hadoop configured machine was obtained by cloning the VM [3]. The cloned VM will be an exact replica of the machine from which it is cloned. This made the whole process of creating a new Hadoop configured machine very much faster. As the machine is cloned, the new machine will also have the same name and ip address as the first one. Inorder to distinguish between each node in the cluster, the machine names and ip addresses should be distinct. This was fixed by editing the name of the new machine in /etc/hosts/ and /etc/hostname. A new ip address was assigned to the bridged adapter interface by setting the network parameters in /etc/network/ interfaces. Thus a fully configured slave node was created to add to the Hadoop cluster [7]. 3.3 Setting Up Virtual Hadoop Cluster Virtual Clusters of varying size were setup. One of the physical machines was configured to be the master node of the cluster. VMs were configured to be the slave nodes of the cluster. The master node was configured on a physical machine so that the namenode and jobtracker daemons can be run on a machine with higher capacity. This will ensure better performance for the jobs run on the cluster. In order to ensure scalability of the cluster, it was extended to multiple VMs. Different network configurations were done to establish communication with VM nodes on different physical systems. In addition to the default NAT adapter, bridged adapter was also configured. The bridged adapter was attached to the wireless interface instead of the internal tap device. Then SSH connection was established between VMs on different systems as well as with the master node. After establishing connection between all the nodes of the cluster, MapReduce programs were run. The execution of MapReduce tasks was repeated for different input sizes. Clusters of different size were also configured by 15
repeating the above steps. The time taken for each MapReduce job was analyzed. The performance of the virtual cluster was enhanced by making use of the capacity scheduler, which takes memory storage on each node, into account. The master node was configured to use capacity scheduler and the whole virtual cluster was running MapReduce with capacity scheduling. Running MapReduce Jobs: Following procedure was followed for running MapReduce[1] tasks. $HADOOP_PATH was configured in /usr/local/hadoop. Formatting Namenode : The namenode was formatted prior to starting up the cluster, in case hdfs size changes or new data comes in. Command : $HADOOP_PATH/bin/hadoop namenode -format Starting Hadoop-Deamons : The Hadoop deamons JobTracker, TaskTracker, NameNode, DataNode and Secondary NameNode was started up. Command : $HADOOP_PATH/bin/start-all.sh The deamons can also be started individually on each node. Command : $HADOOP_PATH/bin/hadoop-deamon.sh start <deamon-name> Copying data files : The files on which the MapReduce job is executed is first copied into the HDFS. Command : $HADOOP_PATH/bin/hadoop dfs-copyfromlocal <local-path> <hdfs-location> Running MapReduce : Once the data files are in place, the MapReduce job is run. Command : $HADOOP_PATH/bin/hadoop <program-name> <input-path> <output-path> 16
Automated Running Through Scripts: Scripts were created to run Hadoop MapReduce. Scripts were also created for cloning of VM to obtain hadoop configured VM image. Scripts were also created for the process of starting up VM and setting up network configurations of the VM. The process of hadoop startup in the VM was also automated. A script was created for starting up the hadoop deamons. The script was scheduled to run at the boot time of the VM. Thus the hadoop deamons were started up automatically when the VM was started. 4 Performance Metrics Virtual clusters of various sizes were setup, and performance was analyzed for varying input sizes by running mapreduce tasks. The variation in execution time was also observed by varying cluster size and configuration parameters. The results of the experiments are presented here, along with experimental data and the graphs obtained. Comparison of Real and Virtual Clusters: With the number of physical machines remaining the same, real and virtual Hadoop clusters were setup. Real cluster consisted of only the physical machines as the nodes. Virtual cluster was setup with VMs in each system along with the physical machine nodes. Thus the virtual cluster had more number of nodes than the real cluster even though both had the same number of physical machines. The performance of both the clusters were analyzed by running MapReduce jobs with inputs of varying size.the experimental data are listed in Table 1 as input file size in MB, time taken for execution of the job in real cluster and time taken for the execution in the virtual cluster. 17
Input Size (MB) Real Cluster (sec) Virtual Cluster (sec) 3.1 34 14 6.0 37 26 11.9 36 30 23.8 45 41 47.6 52 48 59.6 50 33 71.0 60 50 Table 1: Real vs Virtual Cluster Runtime The graph in Figure 2 shows the comparison between the performance of a real cluster and a virtualized cluster on a single physical machine. The experiment was repeated for different input sizes, listed in Table 1. The graph shows that better performance was achieved with virtual Hadoop cluster.this shows better resource utilization is achieved with virtualization. Figure 2: Graph for Real Vs Virtual Cluster Runtime 18
Analysis of execution time with respect to input size: Variation of time taken for MapReduce tasks were analyzed with different input sizes, on a virtual Hadoop cluster. MapReduce jobs were executed for the input sizes in Table 2. Execution time corresponding to each input size is also given in Table 2. Size of input (MB) Time taken (sec) 3.1 14 6.0 26 11.9 30 23.8 41 47.6 48 59.6 63 71.0 57 Table 2: Input Size vs Time Taken The graph in Figure 3 depicts the performance of a virtual Hadoop cluster of three nodes. The variation of execution time for different input sizes is plotted. Figure 3: Graph for Input Size vs Time Taken 19
Analysis of Execution Time with respect to Cluster Size : The performance of the virtual cluster was analyzed by varying the cluster size itself. Virtual Hadoop clusters of varying size was set up. The clusters were set up using multiple machines. In each machine, hadoop configured VMs were started up. Depending on the system capacity, varying number of VMs were started in each machine. Mapreduce programs were run for same input size 6 MB. The experimental data consisting of the number of nodes in the cluster and time taken for processing an input of 6MB on the cluster is listed in Table 3. Number of Nodes Time taken (sec) 2 43 3 42 4 30 7 35 Table 3: Cluster Size vs Time Taken The variation of execution time against cluster size is plotted in the graph in Figure 4. Figure 4: Graph for Cluster Size vs Time Taken 20
Analysis of performance by varying the number of Reduce tasks: For different input sizes, the parameter number of Reduce tasks was varied, and MapReduce job was executed. The number of reduce tasks, is a configurable parameter for Hadoop. The performance was analyzed on the following input sizes - 3.1, 6, 11.9, 23.8, 47.6, 59.6 and 71 (size in MB) for number of Reduce tasks 1, 2, 3, 4, 6, 8 and 16. This parameter can be configured by adding the property mapred.reduce.tasks, in mapred.xml file. The number of reduce tasks can be given as the value of this property. For each input size, the different values of reduce parameters listed above were configured and MapReduce job was run. The time taken for each job was analyzed. In total all the 42 combinations were tried out in a virtual Hadoop cluster. The results obtained are represented in the graph in Figure 5. The graph represents the variation of time with respect two parameters viz, change in input size and the number of reduce tasks. Figure 5: Graph for analysis of MapReduce jobs 21
Scalability of Virtual Hadoop Cluster: The results from the experiment showed that adding more nodes to the cluster decreases the time required to run the wordcount program. After some point, it shows diminishing returns. This suggests that at some point adding more nodes to the cluster will not improve the runtime. So the virtual nodes on each machine should be decided according to the system capacity. In the context of what Hadoop was designed for, the clusters and data set used in the experiment are both considered small. Though Hadoop is meant to handle much larger data sets running on clusters with many more nodes, the experiments on virtual cluster was conducted on relatively small capacity machines. Given the relatively under power of the machines used in the real cluster the results were fairly relevant. According to the graph in scenario 2, using 7 nodes to run the wordcount program nearly reduced the runtime by 14 percent when compared to using 2 nodes. Assuming that this trend could be achieved in other MapReduce programs, improvements on the same scales can be achieved by setting up virtual clusters rather than running Hadoop jobs entirely on physical machines. The addition of more machines in the cluster leads to an even greater reduction in runtime. The virtual cluster can be scaled up according to the resources available. Performance Parameters for Real Time Clusters: Various Hadoop configuration parameters [8] which directly affects MapReduce job performance under various conditions were analyzed inorder to obtain better performance. The following parameters were analyzed to find out the most suited values for the virtual cluster: dfs.block.size: The input data is split into different blocks before processing. dfs.block.size determines the size of the chunk to which the data is split. Increasing the block size will reduce the number of map tasks. Temporary space: If jobs are large, space will be required to store the map output, during execuion. By default, any intermediate data will be stored in temporary space. Increasing temporary space will be advantageous to large jobs. mapred.local.dir: Any temporary Mapreduce data is stored here. More space is advantageous for jobs with large chunks of data. 22
mapred.map.tasks: Number of map tasks executed for the job. In a cluster, the number of DFS blocks usually determine the number of maps. For example, if dfs.block.size = 256 MB, for input size 160 GB, minimum number of maps= (160*1024)/256 = 640 maps [8]. Best performance is achieved when the number of map tasks is set to a value approximately equal to the number of map task slots in the cluster. It can also be a multiple of the number of map slots available. Network traffic is minimized when tasks are sent to a slot on a machine with local copy of the data. Setting the number of map tasks as a multiple of the number of the nodes ensures this and hence results in faster execution. As a rule of thumb, number of map tasks can be set as 10 * the number of slaves (i.e., number of TaskTrackers)[1]. mapred.reduce.tasks: Number of reduce tasks for the job. After the data is sorted, the reduce output is written to the local file system. The write process requests a set of DataNodes that will be used to store the block. If the local host is a DataNode in the file system, the local host will be the first DataNode in the returned set. Such a write to a datanode on localhost is much faster, as they do not require bulk network traffic. Setting the number of reduce tasks as 2* the number of slaves[1] (i.e., number of TaskTrackers), will reduce the network traffic and hence results in better performance. 5 Conclusion In this project virtual clusters of Hadoop were configured and set up. The performance was analyzed and the scalability issues were studied. As the cluster size increased the runtime continued to decrease. Running multiple VMs put a considerable load on the host computer running the virtualization software. The decrease in time obtained by adding more VMs indicates that the use of virtualization helped in better utilization of the resources of the host computer. Future Work The most recent version of Hadoop, which is version 0.23.1 released on 27th February, 2012 has improved significantly in the fields of HDFS and MapReduce. It also addresses the issues of having multiple masters in a single cluster. Hence, the scalability issue can be dealt in a better manner in the new version. It might be possible to have a cluster size ranging in 1000s with 23
the new version. Setting up of virtual cluster with the latest Hadoop version can bring out much better results. More configuration parameters can be analyzed and the performance of the virtual clusters can be increased by fine tuning the value of relevant parameters. References [1] Hadoop, The Apache Software Foundation, Dec. 2011. [Online]. Available: http://hadoop.apache.org/ [2] R.-E. F. (rafan), Hadoop Capacity Scheduler, Hadoop Taiwan User Group meeting, 2009, Yahoo! Search Engineering. [3] Oracle VM VirtualBox, User Manual [Accessed : January 19, 2012]. [Online]. Available: http://www.virtualbox.org/manual/ [4] J. Buell, A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vsphere 5, Technical White Paper, VMware, Oct. 2011. [Online]. Available: http://www.vmware.com/files/pdf/ techpaper/vmw-hadoop-performance-vsphere5.pdf [5] M. G. Noll, Running Hadoop on Ubuntu Linux (Multi-Node Cluster), Aug. 2007, My digital moleskine. [Online]. Available: http://www.michael-noll.com/tutorials/ running-hadoop-on-ubuntu-linux-multi-node-cluster/ [6] C. Macdonald, VirtualBox host to guest networking, Oct. 2009, Callum on life. [Online]. Available: http://www.callum-macdonald. com/2009/10/28/virtualbox-host-to-guest-networking/ [7] Ravindra, Building a Hadoop Cluster using Virtual- Box, Xebia IT Architects India Private Limited, Oct. 2010. [Online]. Available: http://xebee.xebia.in/2010/10/21/ building-a-hadoop-cluster-using-virtualbox/ [8] Impetus, HADOOP PERFORMANCE TUNING, White Paper, Impetus Technologies Inc., Oct. 2009, Partners in Software R&D and Engineering. [Online]. Available: www.impetus.com [9] D. de Nadal Bou, Support for managing dynamically Hadoop clusters, in Master in Information Technology - MTI, Sep. 2010, Project Director : Yolanda Becerra. [Online]. Available: http://hdl.handle.net/2099.1/9920 24
[10] gliffy, online Diagram Software. [Online]. Available: http://www. gliffy.com/ [11] J. Devine, Evaluating the Scalability of Hadoop in a Real and Virtual Environment, Dec. 2008, cs380 Final Project. [Online]. Available: jamesdevine.info/wp-content/uploads/2009/03/project.pdf [12] the Hadooper in me, Research,Labs & IT stuff, Buenos Aires, Nov. 2010. [Online]. Available: http://hadooper.blogspot.com/ 25