Integration Of Virtualization With Hadoop Tools
|
|
- Darcy Griffin
- 8 years ago
- Views:
Transcription
1 Integration Of Virtualization With Hadoop Tools Aparna Raj K aparnaraj.k@iiitb.org Kamaldeep Kaur Kamaldeep.Kaur@iiitb.org Uddipan Dutta Uddipan.Dutta@iiitb.org V Venkat Sandeep Sandeep.VV@iiitb.org Technical Report IIITB-TR April 2012
2 Abstract Processing large amount of data, in a very less time is a major requirement of present day world. The Hadoop Distributed File System (HDFS) [1] and MapReduce [1] are two major components contributing to this task. Even though, MapReduce can process large amount of data, the focus is always on obtaining maximum performance with optimal resources. The main goal of the project is to integrate Hadoop MapReduce framework with virtualization to achieve better resource utilization. We present a virtualized setup of a Hadoop cluster that provides greater computing capacity with lesser resources, since a virtualized cluster requires fewer physical machines. The master node of the cluster is set up on a physical machine, and slave nodes are set up on virtual machines (VMs) that may be on a common physical machine. Hadoop configured VM images are created by cloning of VMs, which facilitates fast addition and deletion of nodes in the cluster without much overhead. Also, we have configured the Hadoop virtual cluster to use capacity scheduler instead of the default FIFO scheduler. The capacity scheduler schedules tasks based on the availability of RAM and virtual memory (VMEM) in slave nodes before allocating any job. So instead of queuing up the jobs, they are efficiently allocated on the VMs based on the memory available. Various configuration parameters of Hadoop are analyzed and the virtual cluster is fine-tuned to ensure best performance and maximum scalability. Project source can be downloaded from the following URL URL: c 2012 Aparna Raj Koroth, Kamaldeep Kaur Randhawa, Uddipan Dutta, Velugoti Venkat Sandeep. This material is available under the Creative Commons Attribution-Noncommercial- Share Alike Licenses. See for details. 2
3 Acknowledgement We are very thankful to the people who have helped and supported us during the project. We express our deepest thanks to our advisor, Prof. Shrisha Rao for his encouraging words and useful suggestions throughout this work. It was a pleasure for us to work under his guidance. We would also like to thank Vinaya MS, one of the alumni of our college for her valuable suggestions. 3
4 Contents 1 Introduction 6 2 System Design Architecture Virtualization Implementation Details Hadoop Configuration VM Installation and Configuration Setting Up Virtual Hadoop Cluster Performance Metrics 17 5 Conclusion 23 References 24 4
5 List of Tables 1 Real vs Virtual Cluster Runtime Input Size vs Time Taken Cluster Size vs Time Taken List of Figures 1 Modified Architecture of Virtual Hadoop Cluster Graph for Real Vs Virtual Cluster Runtime Graph for Input Size vs Time Taken Graph for Cluster Size vs Time Taken Graph for analysis of MapReduce jobs
6 1 Introduction With significant advances in information technology, the amount of data being processed in the computing world, is increasing day by day. As a result, the need of a new kind of infrastructure and system, that can handle large amount of data, is on high demand. Google designed a programming model for cloud computing, called MapReduce [1], as a possible solution for managing large amounts of data sets. Later Hadoop, from the Apache project [1], came up with a more popular open source implementation for MapReduce. MapReduce is a software framework for processing huge volumes of data. It is done on big clusters of compute nodes, to process the data parallely. As the amount of data processed is huge, a lot of resources have to be quickly coordinated and allocated. Though it is true that Hadoop gives good results on physical machines, service providers find it difficult to manage incoming requests quickly. Virtualization can solve such management problems for Hadoop and this project aims at enhancing the Hadoop environment with quick resource allocation capabilities. Objective The major goal of the project is the integration of Hadoop tools with virtualization using Hadoop MapReduce framework. The project aims at building a virtualized environment for the Hadoop version such that maximum performance is obtained with optimal resources. It also focuses to understand the most suitable configuration of Hadoop for best performance on virtualized environment. Description Hadoop MapReduce computes large amounts of data in a reasonable time, due to its high scalability. This is achieved by dividing the job in small tasks that can be split through a large collection of computers. As part of the MapReduce implementation, Hadoop has an internal scheduler for managing these incoming requests. The capabilities of the scheduler are enhanced by configuring a new scheduler, called the capacity scheduler [2]. This scheduler is more suited for virtualized Hadoop cluster for memory management than the default Hadoop scheduler. In the event of any unexpected, unusual high amount of data, more VMs can be added to the cluster via cloning or through normal creation of a VM. 6
7 Gap Analysis Though Hadoop MapReduce works well on physical systems, quick provisioning of resources is a problem. Setting up and configuring Hadoop on a new physical machine and adding it as a node to an existing Hadoop cluster, is time consuming. But with virtualization, a new node configured with Hadoop, can be easily created with cloning of an already configured VM. i. As the computations involved are large, a lot of resources should be quickly coordinated and properly allocated. If there are a large number of requests at a particular time, and if the Hadoop environment requires more resources to process these requests, adding and configuring a new physical machine would take a considerable amount of time. ii. For a large job, adding more TaskTrackers [1] to the cluster will help in faster computations, but there is no flexibility in adding or removing nodes from a Hadoop cluster, setup entirely on physical machines. iii. The JobTracker [1], which acts as the master node in the cluster, is a single point of failure. The project aims at enhancing the Hadoop environment with more virtualized infrastructure so that the Hadoop cluster can be easily extended to a larger cluster. If more nodes are required to finish the jobs, in a cluster, a VM can be cloned to produce a new node. This will be created from a predefined machine image with the required Hadoop configuration and software. As it is already configured according to the requirements, it can be added to the Hadoop cluster easily. This will solve the problem of quick resource provisioning on Hadoop. With multiple VMs running, the overall utilization of the clusters will be improved. 2 System Design The basic components involved are : i. MapReduce framework. ii. HDFS [1] which provides the data storage required by the MapReduce framework. iii. Oracle VM VirtualBox [3] manages creation,cloning and maintenance of VMs. 7
8 2.1 Architecture To analyze the performance of Hadoop clusters with virtualization, clusters were designed with virtual nodes. A virtual Hadoop cluster was setup with master node on the physical machine and slave nodes on VM. The architecture of the system is depicted in Figure 1. Figure 1: Modified Architecture of Virtual Hadoop Cluster The Figure 1 shows a virtual hadoop cluster on a single host machine. The master node of the cluster is the host machine or the physical machine. The Hadoop deamons, Namenode and Jobtracker are run on the master node. Other slave nodes required to form the cluster is set up on VMs. Multiple VMs are set up as slave nodes on which the hadoop deamons, Datanode and Tasktracker [1] are run. The addition of a new slave node can be easily achieved by cloning of an already configured VM. MapReduce Framework MapReduce is a programming model that processes large amount of data, such as web logs, crawled documents, etc. Though the computations involved in MapReduce are more or less straightforward, the input will be very large.if it is distributed across a large number of machines the big computations can be finished within a reasonable amount of time. The nodes in the cluster are categorized as: i. Master node or JobTracker. 8
9 ii. Slave node or TaskTracker The master node or the JobTracker, receives input and splits the job into smaller tasks and assign each split to slave nodes or TaskTrackers for computation. HDFS - The data storage for MapReduce Hadoop has a distributed file system within it, the HDFS. Computation and data storage is done on the same machines by making use of this file system. The use of another dedicated storage is thus avoided. HDFS also follows the client-server architecture just like MapReduce. The master node is called NameNode [1] and the slave nodes are termed as DataNodes [1]. The NameNode stores and co-ordinates all the metainformation of the file system. Each node in the cluster is a DataNode. Capacity Scheduler Capacity scheduler [2] is a new scheduler for Hadoop.By using this scheduler, it is possible to reduce the execution time of jobs. This is achieved by taking the memory capacity of the nodes into consideration while scheduling the jobs. It also helps in achieving improved throughput. The utilization of the cluster can also be increased. Jobs are submitted to the multiple queues of the capacity scheduler. The jobs allocated and the fraction of the capacity allotted is balanced by the scheduler to have a uniform allocation. The queues support job priorities as well. Each queue enforces an upper limit on the percentage of resource allocation, per user. This avoids only few jobs from dominating the resources. Memory-intensive jobs are well supported by this scheduler. Jobs can specify higher memory requirements if necessary, and such jobs will be run only on TaskTrackers with more spare memory. 2.2 Virtualization Though Hadoop works on physical machines and provides good result, it is difficult to quickly add a new node to the cluster. Virtualization [4] can solve such problems for Hadoop, as it does for other software solutions. By virtualizing the nodes in the Hadoop cluster, better resource utilization can be ensured. It also facilitates quick addition and deletion of nodes without much overhead. Partitioning each node to a number of virual machines (VMs), gives us a number of benefits: 9
10 i. Scheduling: With VMs, hardware utilization can be increased. When more computing capacity is required for scheduling batch jobs, the unused capacity of the hardware can be used by the help of virtualization. ii. Resource Utilization: Different kinds of VMs can be hosted on the same physical machine.hadoop VMs can be used along with VMs for other tasks. This results in better resource utilization by consolidating different kinds of applications. iii. Datacenter Efficiency: The types of tasks that can be run on a virtualized infrastructure are more as it is possible to run cross platform applications as well. iv. Deployment: Deployment time for new nodes in a Hadoop cluster can be greatly reduced by virtualization. Configuring Hadoop on a machine can be done quickly by cloning an already configured VM. By creating a Hadoop cluster of VM, better resource utilization can be achieved. The Hadoop environment is modified to use the capacity scheduler. Its capability to schedule tasks based on a job s memory requirements in terms of RAM and virtual memory on a slave node, makes it better suited in a virtual Hadoop cluster. 3 Implementation Details Setting up of virtual Hadoop cluster involves many tasks. They are listed in this section, with the required configurations and settings. After configuring Hadoop on VMs, virtual Hadoop cluster is setup with a physical machine as master node and VMs as slave nodes. The scheduling of jobs in the virtual cluster is enhanced by configuring the capacity scheduler on the master node. 3.1 Hadoop Configuration Hadoop version was installed and a cluster of multiple nodes was set up. The pre-requisites [5] for this are mentioned below - Pre-requisites : Java 1.6 with JAVA_HOME Environment Variable set to the Java installation location 10
11 Secure Shell (SSH) Server and Client and SSH key generated for each machine in the cluster Adding a dedicated Hadoop system user : A dedicated user account [5] was set up for running Hadoop. This helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.this provides better security also. SSH Access : Hadoop requires SSH access to manage its nodes, as the nodes are on different systems or VMs. For our multi-node setup of Hadoop, SSH access was setup between the dedicated user accounts created on the machines. It was then configured to allow SSH public key authentication. It is necessary to generate an SSH key [5] for each machine. The scripts to start Hadoop will need to be able to log into the other machines in the cluster without entering a password. This can be done with the following commands: ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub» $HOME/.ssh/authorized_keys Typically one machine in the cluster is designated as the NameNode and another machine as the JobTracker, exclusively. These are the master nodes of the cluster. Other machines in the cluster act as both DataNode and TaskTracker. These are the slave nodes. The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster have the same HADOOP_HOME path. Hadoop configuration is driven by the following site-specific configuration files [5]: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml Site Configuration: The Hadoop daemons, NameNode/DataNode and JobTracker/TaskTracker are configured to run MapReduce job on the cluster setup. 11
12 In conf/core-site.xml: The Uniform Resource Identifier (URI) of the namenode is specified in this file. core-site.xml [5] <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base for other temporary directories</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description> The name of the default file system. </description> </property> Configuring HDFS: In conf/hdfs-site.xml: Path on the local FileSystem where the NameNode stores the namespace and transaction logs persistently is specified in this file. It also specifies paths on the local FileSystem of a DataNode where it stores its blocks. hdfs-site.xml [5] <property> <name>dfs.replication</name> <value>4</value> <description> Default block replication.the actual number of replications can be specified when the file is created. </description> </property> 12
13 Configuring Mapreduce: In conf/mapred-site.xml: Parameters related to JobTracker and Task- Trackers are specified here. Host or IP and port of JobTracker, is specified. The path on the HDFS where the MapReduce framework stores the data files which are stored as input to various jobs, is specified in mapred-site.xml. If a different scheduler than the default scheduler is used, that is also specified here. mapred-site.xml [5] <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description> The host and port that the MapReduce jobtracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Configuring the Capacity Scheduler: Capacity Scheduler was built from source by executing ant package. The obtained jar file of capacity scheduler was placed in hadoop/build/contrib/ folder. The hadoop master node was then configured to use capacity scheduler instead of the default scheduler. mapred-site.xml [5] <property> <name>mapred.job.tracker.taskscheduler</name> <value>org.apache.hadoop.mapred.capacitytasksceduler</value> <description> The scheduler which is to be used by the jobtracker </description> </property> HADOOP CLASSPATH [5] in conf/hadoop-env.sh was modified by specifying the capacity-scheduler jar. Then the file Mapred-site.xml was 13
14 modified to include a new property, mapred.jobtracker.taskscheduler with value org.apache.hadoop.mapred.capacitytaskscheduler 3.2 VM Installation and Configuration Oracle VM VirtualBox [3], a software package for creating and maintaining VMs, was used for creating VMs. Oracle VM VirtualBox was installed on an existing host operating system as an application. This host application allows additional guest operating systems, each known as a Guest OS, to be loaded and run, each with its own virtual environment. First, a virtual hard disk image was created, with base memory 512 MB and a maximum of 8GB storage. For more flexible storage management, a dynamically allocated virtual disk file was created. This will only use space on the physical hard disk as it fills up. Though it will not shrink again automatically when space is freed, it has much better performance than a statically allocated hard disk. Initially the disk will be very small and will not occupy any space for unused virtual disk sectors. Then the disk will grow every time a disk sector is written to for the first time, until the drive reaches the maximum capacity chosen. A VM was created with the above specification, and Ubuntu was installed as the guest operating system. VM Configurations Network Settings for VM: To establish connection between the VM and the host machine, network settings were changed. Bridge utils, vtun and uml utilites [6] were installed on the host machine. A bridge and a tap device were created on the host machine for the guest to use. Two network adapters were enabled in the VM. Along with the default Network Address Translation (NAT) Adapter [6], a bridged adapter was also configured to use the tap device created on the host. With these settings SSH connection was successfully established between the Host and the VM. The process of creating the bridge and the tap device, on the host machine, was automated. A script was scheduled to run at boot time. Thus at system startup, the bridge and tap device was created automatically. For extending it to mutiple machines, multiple tap devices can be created and each VM can be attached to one device. 14
15 Hadoop On VM: The Sun Java was installed on the VM, after adding the necessary repository to the VM. SSH was then installed and configured on the VM. The Hadoop source folder was copied to the VM, by using the scp comand, i.e., secure copy command [5]. scp is used for secure copying of data. First a single node cluster was run on the VM. After ensuring the correct working of hadoop on VM, it was configured as a slave node in the cluster. Cloning VM: A virtual image of the Hadoop configured machine was obtained by cloning the VM [3]. The cloned VM will be an exact replica of the machine from which it is cloned. This made the whole process of creating a new Hadoop configured machine very much faster. As the machine is cloned, the new machine will also have the same name and ip address as the first one. Inorder to distinguish between each node in the cluster, the machine names and ip addresses should be distinct. This was fixed by editing the name of the new machine in /etc/hosts/ and /etc/hostname. A new ip address was assigned to the bridged adapter interface by setting the network parameters in /etc/network/ interfaces. Thus a fully configured slave node was created to add to the Hadoop cluster [7]. 3.3 Setting Up Virtual Hadoop Cluster Virtual Clusters of varying size were setup. One of the physical machines was configured to be the master node of the cluster. VMs were configured to be the slave nodes of the cluster. The master node was configured on a physical machine so that the namenode and jobtracker daemons can be run on a machine with higher capacity. This will ensure better performance for the jobs run on the cluster. In order to ensure scalability of the cluster, it was extended to multiple VMs. Different network configurations were done to establish communication with VM nodes on different physical systems. In addition to the default NAT adapter, bridged adapter was also configured. The bridged adapter was attached to the wireless interface instead of the internal tap device. Then SSH connection was established between VMs on different systems as well as with the master node. After establishing connection between all the nodes of the cluster, MapReduce programs were run. The execution of MapReduce tasks was repeated for different input sizes. Clusters of different size were also configured by 15
16 repeating the above steps. The time taken for each MapReduce job was analyzed. The performance of the virtual cluster was enhanced by making use of the capacity scheduler, which takes memory storage on each node, into account. The master node was configured to use capacity scheduler and the whole virtual cluster was running MapReduce with capacity scheduling. Running MapReduce Jobs: Following procedure was followed for running MapReduce[1] tasks. $HADOOP_PATH was configured in /usr/local/hadoop. Formatting Namenode : The namenode was formatted prior to starting up the cluster, in case hdfs size changes or new data comes in. Command : $HADOOP_PATH/bin/hadoop namenode -format Starting Hadoop-Deamons : The Hadoop deamons JobTracker, TaskTracker, NameNode, DataNode and Secondary NameNode was started up. Command : $HADOOP_PATH/bin/start-all.sh The deamons can also be started individually on each node. Command : $HADOOP_PATH/bin/hadoop-deamon.sh start <deamon-name> Copying data files : The files on which the MapReduce job is executed is first copied into the HDFS. Command : $HADOOP_PATH/bin/hadoop dfs-copyfromlocal <local-path> <hdfs-location> Running MapReduce : Once the data files are in place, the MapReduce job is run. Command : $HADOOP_PATH/bin/hadoop <program-name> <input-path> <output-path> 16
17 Automated Running Through Scripts: Scripts were created to run Hadoop MapReduce. Scripts were also created for cloning of VM to obtain hadoop configured VM image. Scripts were also created for the process of starting up VM and setting up network configurations of the VM. The process of hadoop startup in the VM was also automated. A script was created for starting up the hadoop deamons. The script was scheduled to run at the boot time of the VM. Thus the hadoop deamons were started up automatically when the VM was started. 4 Performance Metrics Virtual clusters of various sizes were setup, and performance was analyzed for varying input sizes by running mapreduce tasks. The variation in execution time was also observed by varying cluster size and configuration parameters. The results of the experiments are presented here, along with experimental data and the graphs obtained. Comparison of Real and Virtual Clusters: With the number of physical machines remaining the same, real and virtual Hadoop clusters were setup. Real cluster consisted of only the physical machines as the nodes. Virtual cluster was setup with VMs in each system along with the physical machine nodes. Thus the virtual cluster had more number of nodes than the real cluster even though both had the same number of physical machines. The performance of both the clusters were analyzed by running MapReduce jobs with inputs of varying size.the experimental data are listed in Table 1 as input file size in MB, time taken for execution of the job in real cluster and time taken for the execution in the virtual cluster. 17
18 Input Size (MB) Real Cluster (sec) Virtual Cluster (sec) Table 1: Real vs Virtual Cluster Runtime The graph in Figure 2 shows the comparison between the performance of a real cluster and a virtualized cluster on a single physical machine. The experiment was repeated for different input sizes, listed in Table 1. The graph shows that better performance was achieved with virtual Hadoop cluster.this shows better resource utilization is achieved with virtualization. Figure 2: Graph for Real Vs Virtual Cluster Runtime 18
19 Analysis of execution time with respect to input size: Variation of time taken for MapReduce tasks were analyzed with different input sizes, on a virtual Hadoop cluster. MapReduce jobs were executed for the input sizes in Table 2. Execution time corresponding to each input size is also given in Table 2. Size of input (MB) Time taken (sec) Table 2: Input Size vs Time Taken The graph in Figure 3 depicts the performance of a virtual Hadoop cluster of three nodes. The variation of execution time for different input sizes is plotted. Figure 3: Graph for Input Size vs Time Taken 19
20 Analysis of Execution Time with respect to Cluster Size : The performance of the virtual cluster was analyzed by varying the cluster size itself. Virtual Hadoop clusters of varying size was set up. The clusters were set up using multiple machines. In each machine, hadoop configured VMs were started up. Depending on the system capacity, varying number of VMs were started in each machine. Mapreduce programs were run for same input size 6 MB. The experimental data consisting of the number of nodes in the cluster and time taken for processing an input of 6MB on the cluster is listed in Table 3. Number of Nodes Time taken (sec) Table 3: Cluster Size vs Time Taken The variation of execution time against cluster size is plotted in the graph in Figure 4. Figure 4: Graph for Cluster Size vs Time Taken 20
21 Analysis of performance by varying the number of Reduce tasks: For different input sizes, the parameter number of Reduce tasks was varied, and MapReduce job was executed. The number of reduce tasks, is a configurable parameter for Hadoop. The performance was analyzed on the following input sizes - 3.1, 6, 11.9, 23.8, 47.6, 59.6 and 71 (size in MB) for number of Reduce tasks 1, 2, 3, 4, 6, 8 and 16. This parameter can be configured by adding the property mapred.reduce.tasks, in mapred.xml file. The number of reduce tasks can be given as the value of this property. For each input size, the different values of reduce parameters listed above were configured and MapReduce job was run. The time taken for each job was analyzed. In total all the 42 combinations were tried out in a virtual Hadoop cluster. The results obtained are represented in the graph in Figure 5. The graph represents the variation of time with respect two parameters viz, change in input size and the number of reduce tasks. Figure 5: Graph for analysis of MapReduce jobs 21
22 Scalability of Virtual Hadoop Cluster: The results from the experiment showed that adding more nodes to the cluster decreases the time required to run the wordcount program. After some point, it shows diminishing returns. This suggests that at some point adding more nodes to the cluster will not improve the runtime. So the virtual nodes on each machine should be decided according to the system capacity. In the context of what Hadoop was designed for, the clusters and data set used in the experiment are both considered small. Though Hadoop is meant to handle much larger data sets running on clusters with many more nodes, the experiments on virtual cluster was conducted on relatively small capacity machines. Given the relatively under power of the machines used in the real cluster the results were fairly relevant. According to the graph in scenario 2, using 7 nodes to run the wordcount program nearly reduced the runtime by 14 percent when compared to using 2 nodes. Assuming that this trend could be achieved in other MapReduce programs, improvements on the same scales can be achieved by setting up virtual clusters rather than running Hadoop jobs entirely on physical machines. The addition of more machines in the cluster leads to an even greater reduction in runtime. The virtual cluster can be scaled up according to the resources available. Performance Parameters for Real Time Clusters: Various Hadoop configuration parameters [8] which directly affects MapReduce job performance under various conditions were analyzed inorder to obtain better performance. The following parameters were analyzed to find out the most suited values for the virtual cluster: dfs.block.size: The input data is split into different blocks before processing. dfs.block.size determines the size of the chunk to which the data is split. Increasing the block size will reduce the number of map tasks. Temporary space: If jobs are large, space will be required to store the map output, during execuion. By default, any intermediate data will be stored in temporary space. Increasing temporary space will be advantageous to large jobs. mapred.local.dir: Any temporary Mapreduce data is stored here. More space is advantageous for jobs with large chunks of data. 22
23 mapred.map.tasks: Number of map tasks executed for the job. In a cluster, the number of DFS blocks usually determine the number of maps. For example, if dfs.block.size = 256 MB, for input size 160 GB, minimum number of maps= (160*1024)/256 = 640 maps [8]. Best performance is achieved when the number of map tasks is set to a value approximately equal to the number of map task slots in the cluster. It can also be a multiple of the number of map slots available. Network traffic is minimized when tasks are sent to a slot on a machine with local copy of the data. Setting the number of map tasks as a multiple of the number of the nodes ensures this and hence results in faster execution. As a rule of thumb, number of map tasks can be set as 10 * the number of slaves (i.e., number of TaskTrackers)[1]. mapred.reduce.tasks: Number of reduce tasks for the job. After the data is sorted, the reduce output is written to the local file system. The write process requests a set of DataNodes that will be used to store the block. If the local host is a DataNode in the file system, the local host will be the first DataNode in the returned set. Such a write to a datanode on localhost is much faster, as they do not require bulk network traffic. Setting the number of reduce tasks as 2* the number of slaves[1] (i.e., number of TaskTrackers), will reduce the network traffic and hence results in better performance. 5 Conclusion In this project virtual clusters of Hadoop were configured and set up. The performance was analyzed and the scalability issues were studied. As the cluster size increased the runtime continued to decrease. Running multiple VMs put a considerable load on the host computer running the virtualization software. The decrease in time obtained by adding more VMs indicates that the use of virtualization helped in better utilization of the resources of the host computer. Future Work The most recent version of Hadoop, which is version released on 27th February, 2012 has improved significantly in the fields of HDFS and MapReduce. It also addresses the issues of having multiple masters in a single cluster. Hence, the scalability issue can be dealt in a better manner in the new version. It might be possible to have a cluster size ranging in 1000s with 23
24 the new version. Setting up of virtual cluster with the latest Hadoop version can bring out much better results. More configuration parameters can be analyzed and the performance of the virtual clusters can be increased by fine tuning the value of relevant parameters. References [1] Hadoop, The Apache Software Foundation, Dec [Online]. Available: [2] R.-E. F. (rafan), Hadoop Capacity Scheduler, Hadoop Taiwan User Group meeting, 2009, Yahoo! Search Engineering. [3] Oracle VM VirtualBox, User Manual [Accessed : January 19, 2012]. [Online]. Available: [4] J. Buell, A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vsphere 5, Technical White Paper, VMware, Oct [Online]. Available: techpaper/vmw-hadoop-performance-vsphere5.pdf [5] M. G. Noll, Running Hadoop on Ubuntu Linux (Multi-Node Cluster), Aug. 2007, My digital moleskine. [Online]. Available: running-hadoop-on-ubuntu-linux-multi-node-cluster/ [6] C. Macdonald, VirtualBox host to guest networking, Oct. 2009, Callum on life. [Online]. Available: com/2009/10/28/virtualbox-host-to-guest-networking/ [7] Ravindra, Building a Hadoop Cluster using Virtual- Box, Xebia IT Architects India Private Limited, Oct [Online]. Available: building-a-hadoop-cluster-using-virtualbox/ [8] Impetus, HADOOP PERFORMANCE TUNING, White Paper, Impetus Technologies Inc., Oct. 2009, Partners in Software R&D and Engineering. [Online]. Available: [9] D. de Nadal Bou, Support for managing dynamically Hadoop clusters, in Master in Information Technology - MTI, Sep. 2010, Project Director : Yolanda Becerra. [Online]. Available: 24
25 [10] gliffy, online Diagram Software. [Online]. Available: gliffy.com/ [11] J. Devine, Evaluating the Scalability of Hadoop in a Real and Virtual Environment, Dec. 2008, cs380 Final Project. [Online]. Available: jamesdevine.info/wp-content/uploads/2009/03/project.pdf [12] the Hadooper in me, Research,Labs & IT stuff, Buenos Aires, Nov [Online]. Available: 25
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
More informationSetup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
More informationData Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
More information研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationInstallation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
More informationTP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
More informationApache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure
More informationHadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions
Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested
More information1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation
1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationHow To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
More informationSingle Node Setup. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationThis handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.
AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationHadoop Installation. Sandeep Prasad
Hadoop Installation Sandeep Prasad 1 Introduction Hadoop is a system to manage large quantity of data. For this report hadoop- 1.0.3 (Released, May 2012) is used and tested on Ubuntu-12.04. The system
More informationHADOOP - MULTI NODE CLUSTER
HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationRunning Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...
Go Home About Contact Blog Code Publications DMOZ100k06 Photography Running Hadoop On Ubuntu Linux (Multi-Node Cluster) From Michael G. Noll Contents 1 What we want to do 2 Tutorial approach and structure
More informationIMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM Sugandha Agarwal 1, Pragya Jain 2 1,2 Department of Computer Science & Engineering ASET, Amity University, Noida,
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationInstalling Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.
Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You
More informationSingle Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 4: Hadoop Administration An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More informationCDH installation & Application Test Report
CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: she@scu.edu) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest
More information2.1 Hadoop a. Hadoop Installation & Configuration
2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands
More informationIntegrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster
Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node
More informationDistributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
More informationDeploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters
CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family
More informationSet JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/
Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below
More informationHadoop 2.2.0 MultiNode Cluster Setup
Hadoop 2.2.0 MultiNode Cluster Setup Sunil Raiyani Jayam Modi June 7, 2014 Sunil Raiyani Jayam Modi Hadoop 2.2.0 MultiNode Cluster Setup June 7, 2014 1 / 14 Outline 4 Starting Daemons 1 Pre-Requisites
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationEasily parallelize existing application with Hadoop framework Juan Lago, July 2011
Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationMobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage
More informationHow to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup)
How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup) Author : Vignesh Prajapati Categories : Hadoop Tagged as : bigdata, Hadoop Date : April 20, 2015 As you have reached on this blogpost
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationCloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationRunning Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
More informationHDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationDeploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster
More informationCase-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationHADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationQuick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not
More informationPerformance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
More informationHADOOP CLUSTER SETUP GUIDE:
HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do
More informationHadoop Distributed File System Propagation Adapter for Nimbus
University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet
More informationHow To Run Hadoop On A Single Node Cluster
IT 13 034 Examensarbete 30 hp Maj 2013 Optimizing Hadoop Parameters Based on the Application Resource Consumption Ziad Benslimane Institutionen för informationsteknologi Department of Information Technology
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationHadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster
Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Dr. G. Venkata Rami Reddy 1, CH. V. V. N. Srikanth Kumar 2 1 Assistant Professor, Department of SE, School Of Information
More informationNetwork-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
More informationOn- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
More informationInstalling Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0
April 29, 2015 Installing Hadoop Hortonworks Hadoop VERSION 1.0 Mogulla, Deepak Reddy Table of Contents Get Linux platform ready...2 Update Linux...2 Update/install Java:...2 Setup SSH Certificates...3
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationResearch Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze
Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce
More informationCactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)
CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)
More informationReduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationBest Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software
Best Practices for Monitoring Databases on VMware Dean Richards Senior DBA, Confio Software 1 Who Am I? 20+ Years in Oracle & SQL Server DBA and Developer Worked for Oracle Consulting Specialize in Performance
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationHSearch Installation
To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all
More informationOptimize the execution of local physics analysis workflows using Hadoop
Optimize the execution of local physics analysis workflows using Hadoop INFN CCR - GARR Workshop 14-17 May Napoli Hassen Riahi Giacinto Donvito Livio Fano Massimiliano Fasi Andrea Valentini INFN-PERUGIA
More informationSCHOOL OF SCIENCE & ENGINEERING. Installation and configuration system/tool for Hadoop
SCHOOL OF SCIENCE & ENGINEERING Installation and configuration system/tool for Hadoop Capstone Design April 2015 Abdelaziz El Ammari Supervised by: Dr. Nasser Assem i ACKNOWLEDGMENT The idea of this capstone
More informationReference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationA Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationHow to install Apache Hadoop 2.6.0 in Ubuntu (Multi node setup)
How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node setup) Author : Vignesh Prajapati Categories : Hadoop Date : February 22, 2015 Since you have reached on this blogpost of Setting up Multinode Hadoop
More informationBig Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
More informationHADOOP INTO CLOUD: A RESEARCH
HADOOP INTO CLOUD: A RESEARCH Prof.M.S.Malkar H.O.D, PCP, Pune msmalkar@rediffmail.com Ms.Misha Ann Alexander M.E.Student, LecturerPCP,Pune annalexander33@gmail.com Abstract- Hadoop and cloud has gained
More informationH2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
More information