Integration Of Virtualization With Hadoop Tools

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Integration Of Virtualization With Hadoop Tools"

Transcription

1 Integration Of Virtualization With Hadoop Tools Aparna Raj K Kamaldeep Kaur Uddipan Dutta V Venkat Sandeep Technical Report IIITB-TR April 2012

2 Abstract Processing large amount of data, in a very less time is a major requirement of present day world. The Hadoop Distributed File System (HDFS) [1] and MapReduce [1] are two major components contributing to this task. Even though, MapReduce can process large amount of data, the focus is always on obtaining maximum performance with optimal resources. The main goal of the project is to integrate Hadoop MapReduce framework with virtualization to achieve better resource utilization. We present a virtualized setup of a Hadoop cluster that provides greater computing capacity with lesser resources, since a virtualized cluster requires fewer physical machines. The master node of the cluster is set up on a physical machine, and slave nodes are set up on virtual machines (VMs) that may be on a common physical machine. Hadoop configured VM images are created by cloning of VMs, which facilitates fast addition and deletion of nodes in the cluster without much overhead. Also, we have configured the Hadoop virtual cluster to use capacity scheduler instead of the default FIFO scheduler. The capacity scheduler schedules tasks based on the availability of RAM and virtual memory (VMEM) in slave nodes before allocating any job. So instead of queuing up the jobs, they are efficiently allocated on the VMs based on the memory available. Various configuration parameters of Hadoop are analyzed and the virtual cluster is fine-tuned to ensure best performance and maximum scalability. Project source can be downloaded from the following URL URL: c 2012 Aparna Raj Koroth, Kamaldeep Kaur Randhawa, Uddipan Dutta, Velugoti Venkat Sandeep. This material is available under the Creative Commons Attribution-Noncommercial- Share Alike Licenses. See for details. 2

3 Acknowledgement We are very thankful to the people who have helped and supported us during the project. We express our deepest thanks to our advisor, Prof. Shrisha Rao for his encouraging words and useful suggestions throughout this work. It was a pleasure for us to work under his guidance. We would also like to thank Vinaya MS, one of the alumni of our college for her valuable suggestions. 3

4 Contents 1 Introduction 6 2 System Design Architecture Virtualization Implementation Details Hadoop Configuration VM Installation and Configuration Setting Up Virtual Hadoop Cluster Performance Metrics 17 5 Conclusion 23 References 24 4

5 List of Tables 1 Real vs Virtual Cluster Runtime Input Size vs Time Taken Cluster Size vs Time Taken List of Figures 1 Modified Architecture of Virtual Hadoop Cluster Graph for Real Vs Virtual Cluster Runtime Graph for Input Size vs Time Taken Graph for Cluster Size vs Time Taken Graph for analysis of MapReduce jobs

6 1 Introduction With significant advances in information technology, the amount of data being processed in the computing world, is increasing day by day. As a result, the need of a new kind of infrastructure and system, that can handle large amount of data, is on high demand. Google designed a programming model for cloud computing, called MapReduce [1], as a possible solution for managing large amounts of data sets. Later Hadoop, from the Apache project [1], came up with a more popular open source implementation for MapReduce. MapReduce is a software framework for processing huge volumes of data. It is done on big clusters of compute nodes, to process the data parallely. As the amount of data processed is huge, a lot of resources have to be quickly coordinated and allocated. Though it is true that Hadoop gives good results on physical machines, service providers find it difficult to manage incoming requests quickly. Virtualization can solve such management problems for Hadoop and this project aims at enhancing the Hadoop environment with quick resource allocation capabilities. Objective The major goal of the project is the integration of Hadoop tools with virtualization using Hadoop MapReduce framework. The project aims at building a virtualized environment for the Hadoop version such that maximum performance is obtained with optimal resources. It also focuses to understand the most suitable configuration of Hadoop for best performance on virtualized environment. Description Hadoop MapReduce computes large amounts of data in a reasonable time, due to its high scalability. This is achieved by dividing the job in small tasks that can be split through a large collection of computers. As part of the MapReduce implementation, Hadoop has an internal scheduler for managing these incoming requests. The capabilities of the scheduler are enhanced by configuring a new scheduler, called the capacity scheduler [2]. This scheduler is more suited for virtualized Hadoop cluster for memory management than the default Hadoop scheduler. In the event of any unexpected, unusual high amount of data, more VMs can be added to the cluster via cloning or through normal creation of a VM. 6

7 Gap Analysis Though Hadoop MapReduce works well on physical systems, quick provisioning of resources is a problem. Setting up and configuring Hadoop on a new physical machine and adding it as a node to an existing Hadoop cluster, is time consuming. But with virtualization, a new node configured with Hadoop, can be easily created with cloning of an already configured VM. i. As the computations involved are large, a lot of resources should be quickly coordinated and properly allocated. If there are a large number of requests at a particular time, and if the Hadoop environment requires more resources to process these requests, adding and configuring a new physical machine would take a considerable amount of time. ii. For a large job, adding more TaskTrackers [1] to the cluster will help in faster computations, but there is no flexibility in adding or removing nodes from a Hadoop cluster, setup entirely on physical machines. iii. The JobTracker [1], which acts as the master node in the cluster, is a single point of failure. The project aims at enhancing the Hadoop environment with more virtualized infrastructure so that the Hadoop cluster can be easily extended to a larger cluster. If more nodes are required to finish the jobs, in a cluster, a VM can be cloned to produce a new node. This will be created from a predefined machine image with the required Hadoop configuration and software. As it is already configured according to the requirements, it can be added to the Hadoop cluster easily. This will solve the problem of quick resource provisioning on Hadoop. With multiple VMs running, the overall utilization of the clusters will be improved. 2 System Design The basic components involved are : i. MapReduce framework. ii. HDFS [1] which provides the data storage required by the MapReduce framework. iii. Oracle VM VirtualBox [3] manages creation,cloning and maintenance of VMs. 7

8 2.1 Architecture To analyze the performance of Hadoop clusters with virtualization, clusters were designed with virtual nodes. A virtual Hadoop cluster was setup with master node on the physical machine and slave nodes on VM. The architecture of the system is depicted in Figure 1. Figure 1: Modified Architecture of Virtual Hadoop Cluster The Figure 1 shows a virtual hadoop cluster on a single host machine. The master node of the cluster is the host machine or the physical machine. The Hadoop deamons, Namenode and Jobtracker are run on the master node. Other slave nodes required to form the cluster is set up on VMs. Multiple VMs are set up as slave nodes on which the hadoop deamons, Datanode and Tasktracker [1] are run. The addition of a new slave node can be easily achieved by cloning of an already configured VM. MapReduce Framework MapReduce is a programming model that processes large amount of data, such as web logs, crawled documents, etc. Though the computations involved in MapReduce are more or less straightforward, the input will be very large.if it is distributed across a large number of machines the big computations can be finished within a reasonable amount of time. The nodes in the cluster are categorized as: i. Master node or JobTracker. 8

9 ii. Slave node or TaskTracker The master node or the JobTracker, receives input and splits the job into smaller tasks and assign each split to slave nodes or TaskTrackers for computation. HDFS - The data storage for MapReduce Hadoop has a distributed file system within it, the HDFS. Computation and data storage is done on the same machines by making use of this file system. The use of another dedicated storage is thus avoided. HDFS also follows the client-server architecture just like MapReduce. The master node is called NameNode [1] and the slave nodes are termed as DataNodes [1]. The NameNode stores and co-ordinates all the metainformation of the file system. Each node in the cluster is a DataNode. Capacity Scheduler Capacity scheduler [2] is a new scheduler for Hadoop.By using this scheduler, it is possible to reduce the execution time of jobs. This is achieved by taking the memory capacity of the nodes into consideration while scheduling the jobs. It also helps in achieving improved throughput. The utilization of the cluster can also be increased. Jobs are submitted to the multiple queues of the capacity scheduler. The jobs allocated and the fraction of the capacity allotted is balanced by the scheduler to have a uniform allocation. The queues support job priorities as well. Each queue enforces an upper limit on the percentage of resource allocation, per user. This avoids only few jobs from dominating the resources. Memory-intensive jobs are well supported by this scheduler. Jobs can specify higher memory requirements if necessary, and such jobs will be run only on TaskTrackers with more spare memory. 2.2 Virtualization Though Hadoop works on physical machines and provides good result, it is difficult to quickly add a new node to the cluster. Virtualization [4] can solve such problems for Hadoop, as it does for other software solutions. By virtualizing the nodes in the Hadoop cluster, better resource utilization can be ensured. It also facilitates quick addition and deletion of nodes without much overhead. Partitioning each node to a number of virual machines (VMs), gives us a number of benefits: 9

10 i. Scheduling: With VMs, hardware utilization can be increased. When more computing capacity is required for scheduling batch jobs, the unused capacity of the hardware can be used by the help of virtualization. ii. Resource Utilization: Different kinds of VMs can be hosted on the same physical machine.hadoop VMs can be used along with VMs for other tasks. This results in better resource utilization by consolidating different kinds of applications. iii. Datacenter Efficiency: The types of tasks that can be run on a virtualized infrastructure are more as it is possible to run cross platform applications as well. iv. Deployment: Deployment time for new nodes in a Hadoop cluster can be greatly reduced by virtualization. Configuring Hadoop on a machine can be done quickly by cloning an already configured VM. By creating a Hadoop cluster of VM, better resource utilization can be achieved. The Hadoop environment is modified to use the capacity scheduler. Its capability to schedule tasks based on a job s memory requirements in terms of RAM and virtual memory on a slave node, makes it better suited in a virtual Hadoop cluster. 3 Implementation Details Setting up of virtual Hadoop cluster involves many tasks. They are listed in this section, with the required configurations and settings. After configuring Hadoop on VMs, virtual Hadoop cluster is setup with a physical machine as master node and VMs as slave nodes. The scheduling of jobs in the virtual cluster is enhanced by configuring the capacity scheduler on the master node. 3.1 Hadoop Configuration Hadoop version was installed and a cluster of multiple nodes was set up. The pre-requisites [5] for this are mentioned below - Pre-requisites : Java 1.6 with JAVA_HOME Environment Variable set to the Java installation location 10

11 Secure Shell (SSH) Server and Client and SSH key generated for each machine in the cluster Adding a dedicated Hadoop system user : A dedicated user account [5] was set up for running Hadoop. This helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.this provides better security also. SSH Access : Hadoop requires SSH access to manage its nodes, as the nodes are on different systems or VMs. For our multi-node setup of Hadoop, SSH access was setup between the dedicated user accounts created on the machines. It was then configured to allow SSH public key authentication. It is necessary to generate an SSH key [5] for each machine. The scripts to start Hadoop will need to be able to log into the other machines in the cluster without entering a password. This can be done with the following commands: ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub» $HOME/.ssh/authorized_keys Typically one machine in the cluster is designated as the NameNode and another machine as the JobTracker, exclusively. These are the master nodes of the cluster. Other machines in the cluster act as both DataNode and TaskTracker. These are the slave nodes. The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster have the same HADOOP_HOME path. Hadoop configuration is driven by the following site-specific configuration files [5]: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml Site Configuration: The Hadoop daemons, NameNode/DataNode and JobTracker/TaskTracker are configured to run MapReduce job on the cluster setup. 11

12 In conf/core-site.xml: The Uniform Resource Identifier (URI) of the namenode is specified in this file. core-site.xml [5] <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base for other temporary directories</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description> The name of the default file system. </description> </property> Configuring HDFS: In conf/hdfs-site.xml: Path on the local FileSystem where the NameNode stores the namespace and transaction logs persistently is specified in this file. It also specifies paths on the local FileSystem of a DataNode where it stores its blocks. hdfs-site.xml [5] <property> <name>dfs.replication</name> <value>4</value> <description> Default block replication.the actual number of replications can be specified when the file is created. </description> </property> 12

13 Configuring Mapreduce: In conf/mapred-site.xml: Parameters related to JobTracker and Task- Trackers are specified here. Host or IP and port of JobTracker, is specified. The path on the HDFS where the MapReduce framework stores the data files which are stored as input to various jobs, is specified in mapred-site.xml. If a different scheduler than the default scheduler is used, that is also specified here. mapred-site.xml [5] <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description> The host and port that the MapReduce jobtracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Configuring the Capacity Scheduler: Capacity Scheduler was built from source by executing ant package. The obtained jar file of capacity scheduler was placed in hadoop/build/contrib/ folder. The hadoop master node was then configured to use capacity scheduler instead of the default scheduler. mapred-site.xml [5] <property> <name>mapred.job.tracker.taskscheduler</name> <value>org.apache.hadoop.mapred.capacitytasksceduler</value> <description> The scheduler which is to be used by the jobtracker </description> </property> HADOOP CLASSPATH [5] in conf/hadoop-env.sh was modified by specifying the capacity-scheduler jar. Then the file Mapred-site.xml was 13

14 modified to include a new property, mapred.jobtracker.taskscheduler with value org.apache.hadoop.mapred.capacitytaskscheduler 3.2 VM Installation and Configuration Oracle VM VirtualBox [3], a software package for creating and maintaining VMs, was used for creating VMs. Oracle VM VirtualBox was installed on an existing host operating system as an application. This host application allows additional guest operating systems, each known as a Guest OS, to be loaded and run, each with its own virtual environment. First, a virtual hard disk image was created, with base memory 512 MB and a maximum of 8GB storage. For more flexible storage management, a dynamically allocated virtual disk file was created. This will only use space on the physical hard disk as it fills up. Though it will not shrink again automatically when space is freed, it has much better performance than a statically allocated hard disk. Initially the disk will be very small and will not occupy any space for unused virtual disk sectors. Then the disk will grow every time a disk sector is written to for the first time, until the drive reaches the maximum capacity chosen. A VM was created with the above specification, and Ubuntu was installed as the guest operating system. VM Configurations Network Settings for VM: To establish connection between the VM and the host machine, network settings were changed. Bridge utils, vtun and uml utilites [6] were installed on the host machine. A bridge and a tap device were created on the host machine for the guest to use. Two network adapters were enabled in the VM. Along with the default Network Address Translation (NAT) Adapter [6], a bridged adapter was also configured to use the tap device created on the host. With these settings SSH connection was successfully established between the Host and the VM. The process of creating the bridge and the tap device, on the host machine, was automated. A script was scheduled to run at boot time. Thus at system startup, the bridge and tap device was created automatically. For extending it to mutiple machines, multiple tap devices can be created and each VM can be attached to one device. 14

15 Hadoop On VM: The Sun Java was installed on the VM, after adding the necessary repository to the VM. SSH was then installed and configured on the VM. The Hadoop source folder was copied to the VM, by using the scp comand, i.e., secure copy command [5]. scp is used for secure copying of data. First a single node cluster was run on the VM. After ensuring the correct working of hadoop on VM, it was configured as a slave node in the cluster. Cloning VM: A virtual image of the Hadoop configured machine was obtained by cloning the VM [3]. The cloned VM will be an exact replica of the machine from which it is cloned. This made the whole process of creating a new Hadoop configured machine very much faster. As the machine is cloned, the new machine will also have the same name and ip address as the first one. Inorder to distinguish between each node in the cluster, the machine names and ip addresses should be distinct. This was fixed by editing the name of the new machine in /etc/hosts/ and /etc/hostname. A new ip address was assigned to the bridged adapter interface by setting the network parameters in /etc/network/ interfaces. Thus a fully configured slave node was created to add to the Hadoop cluster [7]. 3.3 Setting Up Virtual Hadoop Cluster Virtual Clusters of varying size were setup. One of the physical machines was configured to be the master node of the cluster. VMs were configured to be the slave nodes of the cluster. The master node was configured on a physical machine so that the namenode and jobtracker daemons can be run on a machine with higher capacity. This will ensure better performance for the jobs run on the cluster. In order to ensure scalability of the cluster, it was extended to multiple VMs. Different network configurations were done to establish communication with VM nodes on different physical systems. In addition to the default NAT adapter, bridged adapter was also configured. The bridged adapter was attached to the wireless interface instead of the internal tap device. Then SSH connection was established between VMs on different systems as well as with the master node. After establishing connection between all the nodes of the cluster, MapReduce programs were run. The execution of MapReduce tasks was repeated for different input sizes. Clusters of different size were also configured by 15

16 repeating the above steps. The time taken for each MapReduce job was analyzed. The performance of the virtual cluster was enhanced by making use of the capacity scheduler, which takes memory storage on each node, into account. The master node was configured to use capacity scheduler and the whole virtual cluster was running MapReduce with capacity scheduling. Running MapReduce Jobs: Following procedure was followed for running MapReduce[1] tasks. $HADOOP_PATH was configured in /usr/local/hadoop. Formatting Namenode : The namenode was formatted prior to starting up the cluster, in case hdfs size changes or new data comes in. Command : $HADOOP_PATH/bin/hadoop namenode -format Starting Hadoop-Deamons : The Hadoop deamons JobTracker, TaskTracker, NameNode, DataNode and Secondary NameNode was started up. Command : $HADOOP_PATH/bin/start-all.sh The deamons can also be started individually on each node. Command : $HADOOP_PATH/bin/hadoop-deamon.sh start <deamon-name> Copying data files : The files on which the MapReduce job is executed is first copied into the HDFS. Command : $HADOOP_PATH/bin/hadoop dfs-copyfromlocal <local-path> <hdfs-location> Running MapReduce : Once the data files are in place, the MapReduce job is run. Command : $HADOOP_PATH/bin/hadoop <program-name> <input-path> <output-path> 16

17 Automated Running Through Scripts: Scripts were created to run Hadoop MapReduce. Scripts were also created for cloning of VM to obtain hadoop configured VM image. Scripts were also created for the process of starting up VM and setting up network configurations of the VM. The process of hadoop startup in the VM was also automated. A script was created for starting up the hadoop deamons. The script was scheduled to run at the boot time of the VM. Thus the hadoop deamons were started up automatically when the VM was started. 4 Performance Metrics Virtual clusters of various sizes were setup, and performance was analyzed for varying input sizes by running mapreduce tasks. The variation in execution time was also observed by varying cluster size and configuration parameters. The results of the experiments are presented here, along with experimental data and the graphs obtained. Comparison of Real and Virtual Clusters: With the number of physical machines remaining the same, real and virtual Hadoop clusters were setup. Real cluster consisted of only the physical machines as the nodes. Virtual cluster was setup with VMs in each system along with the physical machine nodes. Thus the virtual cluster had more number of nodes than the real cluster even though both had the same number of physical machines. The performance of both the clusters were analyzed by running MapReduce jobs with inputs of varying size.the experimental data are listed in Table 1 as input file size in MB, time taken for execution of the job in real cluster and time taken for the execution in the virtual cluster. 17

18 Input Size (MB) Real Cluster (sec) Virtual Cluster (sec) Table 1: Real vs Virtual Cluster Runtime The graph in Figure 2 shows the comparison between the performance of a real cluster and a virtualized cluster on a single physical machine. The experiment was repeated for different input sizes, listed in Table 1. The graph shows that better performance was achieved with virtual Hadoop cluster.this shows better resource utilization is achieved with virtualization. Figure 2: Graph for Real Vs Virtual Cluster Runtime 18

19 Analysis of execution time with respect to input size: Variation of time taken for MapReduce tasks were analyzed with different input sizes, on a virtual Hadoop cluster. MapReduce jobs were executed for the input sizes in Table 2. Execution time corresponding to each input size is also given in Table 2. Size of input (MB) Time taken (sec) Table 2: Input Size vs Time Taken The graph in Figure 3 depicts the performance of a virtual Hadoop cluster of three nodes. The variation of execution time for different input sizes is plotted. Figure 3: Graph for Input Size vs Time Taken 19

20 Analysis of Execution Time with respect to Cluster Size : The performance of the virtual cluster was analyzed by varying the cluster size itself. Virtual Hadoop clusters of varying size was set up. The clusters were set up using multiple machines. In each machine, hadoop configured VMs were started up. Depending on the system capacity, varying number of VMs were started in each machine. Mapreduce programs were run for same input size 6 MB. The experimental data consisting of the number of nodes in the cluster and time taken for processing an input of 6MB on the cluster is listed in Table 3. Number of Nodes Time taken (sec) Table 3: Cluster Size vs Time Taken The variation of execution time against cluster size is plotted in the graph in Figure 4. Figure 4: Graph for Cluster Size vs Time Taken 20

21 Analysis of performance by varying the number of Reduce tasks: For different input sizes, the parameter number of Reduce tasks was varied, and MapReduce job was executed. The number of reduce tasks, is a configurable parameter for Hadoop. The performance was analyzed on the following input sizes - 3.1, 6, 11.9, 23.8, 47.6, 59.6 and 71 (size in MB) for number of Reduce tasks 1, 2, 3, 4, 6, 8 and 16. This parameter can be configured by adding the property mapred.reduce.tasks, in mapred.xml file. The number of reduce tasks can be given as the value of this property. For each input size, the different values of reduce parameters listed above were configured and MapReduce job was run. The time taken for each job was analyzed. In total all the 42 combinations were tried out in a virtual Hadoop cluster. The results obtained are represented in the graph in Figure 5. The graph represents the variation of time with respect two parameters viz, change in input size and the number of reduce tasks. Figure 5: Graph for analysis of MapReduce jobs 21

22 Scalability of Virtual Hadoop Cluster: The results from the experiment showed that adding more nodes to the cluster decreases the time required to run the wordcount program. After some point, it shows diminishing returns. This suggests that at some point adding more nodes to the cluster will not improve the runtime. So the virtual nodes on each machine should be decided according to the system capacity. In the context of what Hadoop was designed for, the clusters and data set used in the experiment are both considered small. Though Hadoop is meant to handle much larger data sets running on clusters with many more nodes, the experiments on virtual cluster was conducted on relatively small capacity machines. Given the relatively under power of the machines used in the real cluster the results were fairly relevant. According to the graph in scenario 2, using 7 nodes to run the wordcount program nearly reduced the runtime by 14 percent when compared to using 2 nodes. Assuming that this trend could be achieved in other MapReduce programs, improvements on the same scales can be achieved by setting up virtual clusters rather than running Hadoop jobs entirely on physical machines. The addition of more machines in the cluster leads to an even greater reduction in runtime. The virtual cluster can be scaled up according to the resources available. Performance Parameters for Real Time Clusters: Various Hadoop configuration parameters [8] which directly affects MapReduce job performance under various conditions were analyzed inorder to obtain better performance. The following parameters were analyzed to find out the most suited values for the virtual cluster: dfs.block.size: The input data is split into different blocks before processing. dfs.block.size determines the size of the chunk to which the data is split. Increasing the block size will reduce the number of map tasks. Temporary space: If jobs are large, space will be required to store the map output, during execuion. By default, any intermediate data will be stored in temporary space. Increasing temporary space will be advantageous to large jobs. mapred.local.dir: Any temporary Mapreduce data is stored here. More space is advantageous for jobs with large chunks of data. 22

23 mapred.map.tasks: Number of map tasks executed for the job. In a cluster, the number of DFS blocks usually determine the number of maps. For example, if dfs.block.size = 256 MB, for input size 160 GB, minimum number of maps= (160*1024)/256 = 640 maps [8]. Best performance is achieved when the number of map tasks is set to a value approximately equal to the number of map task slots in the cluster. It can also be a multiple of the number of map slots available. Network traffic is minimized when tasks are sent to a slot on a machine with local copy of the data. Setting the number of map tasks as a multiple of the number of the nodes ensures this and hence results in faster execution. As a rule of thumb, number of map tasks can be set as 10 * the number of slaves (i.e., number of TaskTrackers)[1]. mapred.reduce.tasks: Number of reduce tasks for the job. After the data is sorted, the reduce output is written to the local file system. The write process requests a set of DataNodes that will be used to store the block. If the local host is a DataNode in the file system, the local host will be the first DataNode in the returned set. Such a write to a datanode on localhost is much faster, as they do not require bulk network traffic. Setting the number of reduce tasks as 2* the number of slaves[1] (i.e., number of TaskTrackers), will reduce the network traffic and hence results in better performance. 5 Conclusion In this project virtual clusters of Hadoop were configured and set up. The performance was analyzed and the scalability issues were studied. As the cluster size increased the runtime continued to decrease. Running multiple VMs put a considerable load on the host computer running the virtualization software. The decrease in time obtained by adding more VMs indicates that the use of virtualization helped in better utilization of the resources of the host computer. Future Work The most recent version of Hadoop, which is version released on 27th February, 2012 has improved significantly in the fields of HDFS and MapReduce. It also addresses the issues of having multiple masters in a single cluster. Hence, the scalability issue can be dealt in a better manner in the new version. It might be possible to have a cluster size ranging in 1000s with 23

24 the new version. Setting up of virtual cluster with the latest Hadoop version can bring out much better results. More configuration parameters can be analyzed and the performance of the virtual clusters can be increased by fine tuning the value of relevant parameters. References [1] Hadoop, The Apache Software Foundation, Dec [Online]. Available: [2] R.-E. F. (rafan), Hadoop Capacity Scheduler, Hadoop Taiwan User Group meeting, 2009, Yahoo! Search Engineering. [3] Oracle VM VirtualBox, User Manual [Accessed : January 19, 2012]. [Online]. Available: [4] J. Buell, A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vsphere 5, Technical White Paper, VMware, Oct [Online]. Available: techpaper/vmw-hadoop-performance-vsphere5.pdf [5] M. G. Noll, Running Hadoop on Ubuntu Linux (Multi-Node Cluster), Aug. 2007, My digital moleskine. [Online]. Available: running-hadoop-on-ubuntu-linux-multi-node-cluster/ [6] C. Macdonald, VirtualBox host to guest networking, Oct. 2009, Callum on life. [Online]. Available: com/2009/10/28/virtualbox-host-to-guest-networking/ [7] Ravindra, Building a Hadoop Cluster using Virtual- Box, Xebia IT Architects India Private Limited, Oct [Online]. Available: building-a-hadoop-cluster-using-virtualbox/ [8] Impetus, HADOOP PERFORMANCE TUNING, White Paper, Impetus Technologies Inc., Oct. 2009, Partners in Software R&D and Engineering. [Online]. Available: [9] D. de Nadal Bou, Support for managing dynamically Hadoop clusters, in Master in Information Technology - MTI, Sep. 2010, Project Director : Yolanda Becerra. [Online]. Available: 24

25 [10] gliffy, online Diagram Software. [Online]. Available: gliffy.com/ [11] J. Devine, Evaluating the Scalability of Hadoop in a Real and Virtual Environment, Dec. 2008, cs380 Final Project. [Online]. Available: jamesdevine.info/wp-content/uploads/2009/03/project.pdf [12] the Hadooper in me, Research,Labs & IT stuff, Buenos Aires, Nov [Online]. Available: 25

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved. Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Single Node Setup. Table of contents

Single Node Setup. Table of contents Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone

More information

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation 1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible

More information

Hadoop Installation Tutorial (Hadoop 1.x)

Hadoop Installation Tutorial (Hadoop 1.x) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3

More information

Hadoop in Action. Justin Quan March 15, 2011

Hadoop in Action. Justin Quan March 15, 2011 Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Hadoop Installation. Sandeep Prasad

Hadoop Installation. Sandeep Prasad Hadoop Installation Sandeep Prasad 1 Introduction Hadoop is a system to manage large quantity of data. For this report hadoop- 1.0.3 (Released, May 2012) is used and tested on Ubuntu-12.04. The system

More information

CDH installation & Application Test Report

CDH installation & Application Test Report CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: she@scu.edu) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

HADOOP - MULTI NODE CLUSTER

HADOOP - MULTI NODE CLUSTER HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed

More information

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM Sugandha Agarwal 1, Pragya Jain 2 1,2 Department of Computer Science & Engineering ASET, Amity University, Noida,

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G... Go Home About Contact Blog Code Publications DMOZ100k06 Photography Running Hadoop On Ubuntu Linux (Multi-Node Cluster) From Michael G. Noll Contents 1 What we want to do 2 Tutorial approach and structure

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 4: Hadoop Administration An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/ Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below

More information

Hadoop 2.2.0 MultiNode Cluster Setup

Hadoop 2.2.0 MultiNode Cluster Setup Hadoop 2.2.0 MultiNode Cluster Setup Sunil Raiyani Jayam Modi June 7, 2014 Sunil Raiyani Jayam Modi Hadoop 2.2.0 MultiNode Cluster Setup June 7, 2014 1 / 14 Outline 4 Starting Daemons 1 Pre-Requisites

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

2.1 Hadoop a. Hadoop Installation & Configuration

2.1 Hadoop a. Hadoop Installation & Configuration 2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands

More information

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,

More information

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after

More information

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup)

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup) How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup) Author : Vignesh Prajapati Categories : Hadoop Tagged as : bigdata, Hadoop Date : April 20, 2015 As you have reached on this blogpost

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not

More information

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,

More information

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop Distributed File System Propagation Adapter for Nimbus University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster

More information

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0 April 29, 2015 Installing Hadoop Hortonworks Hadoop VERSION 1.0 Mogulla, Deepak Reddy Table of Contents Get Linux platform ready...2 Update Linux...2 Update/install Java:...2 Setup SSH Certificates...3

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

Running Kmeans Mapreduce code on Amazon AWS

Running Kmeans Mapreduce code on Amazon AWS Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

Capacity Scheduler Guide

Capacity Scheduler Guide Table of contents 1 Purpose...2 2 Features... 2 3 Picking a task to run...2 4 Installation...3 5 Configuration... 3 5.1 Using the Capacity Scheduler... 3 5.2 Setting up queues...3 5.3 Configuring properties

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software Best Practices for Monitoring Databases on VMware Dean Richards Senior DBA, Confio Software 1 Who Am I? 20+ Years in Oracle & SQL Server DBA and Developer Worked for Oracle Consulting Specialize in Performance

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

HADOOP INTO CLOUD: A RESEARCH

HADOOP INTO CLOUD: A RESEARCH HADOOP INTO CLOUD: A RESEARCH Prof.M.S.Malkar H.O.D, PCP, Pune msmalkar@rediffmail.com Ms.Misha Ann Alexander M.E.Student, LecturerPCP,Pune annalexander33@gmail.com Abstract- Hadoop and cloud has gained

More information

HADOOP CLUSTER SETUP GUIDE:

HADOOP CLUSTER SETUP GUIDE: HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do

More information

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)

More information

Optimizing Hadoop Parameters Based on the Application Resource Consumption

Optimizing Hadoop Parameters Based on the Application Resource Consumption IT 13 034 Examensarbete 30 hp Maj 2013 Optimizing Hadoop Parameters Based on the Application Resource Consumption Ziad Benslimane Institutionen för informationsteknologi Department of Information Technology

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Dr. G. Venkata Rami Reddy 1, CH. V. V. N. Srikanth Kumar 2 1 Assistant Professor, Department of SE, School Of Information

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

Optimize the execution of local physics analysis workflows using Hadoop

Optimize the execution of local physics analysis workflows using Hadoop Optimize the execution of local physics analysis workflows using Hadoop INFN CCR - GARR Workshop 14-17 May Napoli Hassen Riahi Giacinto Donvito Livio Fano Massimiliano Fasi Andrea Valentini INFN-PERUGIA

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Hadoop 2.6 Configuration and More Examples

Hadoop 2.6 Configuration and More Examples Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information