@ 2014 SEMAR GROUPS TECHNICAL SOCIETY.

Size: px
Start display at page:

Download "@ 2014 SEMAR GROUPS TECHNICAL SOCIETY."

Transcription

1 ISSN Vol.03,Issue.11 June-2014, Pages: Enhance the Performance of Cloud Computing with Hadoop Dept of CSE, College of Science, Kirkuk University, Ministry of Higher Education & Scientific Research, Iraq, Abstract: Cloud computing start as an option, now slowly turn to become a necessity, cloud provide a quick solution, it can be consider cheap if it compared to another solutions, but the cloud like any other thing has disadvantages, which need to be vanished, to enhance the cloud environment, in the cloud especially in the IaaS services need to improve these words pay less, get more profit, not only in the IaaS service also keep the client s data save, keeping the data save can be the first request to the client to make him/her trust in the cloud, and many other more. In the other side there a technology called Hadoop, which can be consider a new technology in the world of cloud computing, hadoop depend on smart strategy, hadoop uses cheap hardware requirements but provide much more, its provide fast processing of data comparing to the cheap environment, provide more than of the provided hardware requirements for storage, its provide a technique for saving the data from losing and more, hadoop improved its success with many of the successful web browsers like twitter, Facebook etc. In the previous paper the performance of analyzing data with help of both hadoop and the cloud had been examined, using the basic tools of hadoop, the results was impressive, the results was 86% less of time comparing with big size of the processed files. In this paper using another strategy of enhancing the performance of both hadoop and the cloud computing, in this paper the results that obtained, when it comparing previous paper, 98% of time comparing with big size of processed files had been got. Keywords: Cloud Computing; Hadoop; Java. I. INTRODUCTION Based on "National institute standards and technology" (NIST) in 2013[10] the Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Cloud computing three major types of services, SaaS is the model that enabling an application is hosted as a service to customers who access it via the Internet, PaaS the second type of cloud computing services in this service the service provider is responsible of delivering another type of applications, these applications are the software resources, the PaaS vendor supplies all the resources required to build applications and services completely via the internet, without having to download or install software, IaaS is simply offers the hardware so that your organization can use it, rather than purchase servers, racks, so simply you can use the data center without building and purchase it, the service provider rents those resources to you, the IaaS can be dynamically scaled up or down, based on the application resource needs. [8] Hadoop is an open source framework, which deal with the distributed system and has the ability to analysis and execute BIG DATA or files that contain huge size (terabytes and more). Hadoop consist of two major components the first one the servers which responsible of the storage, the second one is the technique used to connect the servers inside hadoop cluster. The first part called HDFS, HDFS Characterized from the others distributed systems such as NFS, NFS provides one machine to store the client s files by making this portion is visible to them, so think if there is a big load in this server from the clients, this server can be crashed, and many other problems in the other distributed systems, HDFS has the ability to solve the problems of the other distributed systems, HDFS divide the storage in three major components, these components are NameNode, DataNode, Secondary Name Node, the NameNode work as an index first of all the NameNode receives incoming files which need to store for later processing from the clients, divide the data, then distribute these partitioned data to the other servers in the cluster of hadoop, DataNode is the server which work for storing the divided data which come from the NameNode, also hadoop provide a backup for the NameNode if any kind of failure happens, the Secondary NameNode which work as a backup for the server NameNode, the second component called MapReduce the heart of hadoop, MapReduce is a programming framework tool, originally created by Google but later had been developed by Apache hadoop, consist of JobTracker TaskTracker, the client contact with JobTracker send a request for processing the data, the JobTracker send 2014 SEMAR GROUPS TECHNICAL SOCIETY. All rights reserved.

2 the request to one of the TaskTrackers, then the processing start. II. RELATED WORKS [1] The researchers in this dissertation had merged hadoop and cloud computing and implemented the security access to the cloud using finger print identification and face identification, the hadoop cloud computing used in this dissertation connected to the mobile and thin clients and had been connected using wired or wireless network, the master server connected to the salves server through hadoop cloud computing, the implemented cloud computing in this dissertation produce services SaaS, PaaS, IaaS, the server operating system implemented in this dissertation was Linux platform connected via Ethernet, Wi-Fi, 3G, to connect to the client, the cloud had been built using (JamVM) virtual machine as a platform to build J2ME that work as a java platform for mobile clients. The results showed that finger identification and face identification had been processed within only 2.2 seconds to get the identification of the person. [2] The researchers in this dissertation had studied and improved that the performance of Vm, can be increased with the load balancing to all Vm in the cloud, they had implanted hadoop among these Vm to get the load balancing feature, the example cloud they had used is Eucalyptus cloud,the system that they proposed EuQoS system also they tested the proposed system in the real- time data,then they made a comparison between their proposed system and the normal Eucalyptus cloud hadoop with the they had found that their proposed system improved their performance with 6.94% comparing with hadoop. The proposed EuQoS system consist of HDFS, MapReduce, HBase, and load balancing, the MapReduce task be responsible of mapping the jobs and the HDFS be responsible of saving the status of mapping and produce it to the reduce task, in the HDFS they had inherit the basic idea of HDFS and created DFS which contain of namenode that will be responsible of open and name and clos the files and datanode it will be responsible of read or write the request and deal the actual files that stored in it, in the HBase phase in the EuQoS system will be merged with the Eucalyptus cloud, the HMaster server will assign the work to the Hregion and if these Hregion failed to do the work, then the HMaster will reassign the work to another Hregion, the Hregion server be responsible on the on the read and write and the requests that come from HMaster the last components of the EuQoS system is Load balancing, the load balancing consist from two components load balancer and agent-based monitor, the load balancer is responsible of these three functions balance triggering, EuQoS scheduling and VM controlling. As a result the researchers had used the IPV for the connection between the VM and they found that the performance of Eucalyptus cloud with hadoop was less than the EuQoS system. [3] The researchers in this dissertation made comparison between the other PAAS used in the cloud (hadoop, Dryad,HPCC), also they proposed by using cloud computing and hadoop an application handle one of the most complexity data the vision computing, they had created a Rhizome cloud for detection and computing the vision with the help of hadoop so there will be a speed analyzing for the captured data from video, they had used an representation as a input for the cloud, the nodes used in the cluster of hadoop treated as a grains, the algorithm used among the nodes are (FIFO), each grain worked Independent from the other grains, because hadoop offer this feature which increase the ability to protect the whole cloud and also ensuring that when some failure happen in one of the grains will affect to the other grains, the grains work under the grain manager (masternode ) the grain manger will take the responsible of watching the work of the other grains, these grains are responsible of computing the vision by using video capture, motion detection or even matrix multiplication, the performance of hadoop with JNI Rhizome had been tested with another application using hadoop with Rhizome, the input files was for two hours video,the INJ with Rhizome had taken for the frame around 80 minute, but using only hadoop with Rhizome take for the same input of frames around (31) minute, also another comparison by using the matrix multiplication the video with high Quality (1920 x 1080) take 8 M/Sec while with the using hadoop for analyzing the Surveillance cameras had improved getting speed result with only little Milliseconds. [4] In the previous paper an investigation had been built to improve the performance of hadoop and the cloud computing, the performance of the cloud computing with the basic tools of hadoop, had improved a success with 86 %, so in this paper another investigation will made based on the previous paper another scenario built according to the need of the client, some needed to increase their tools, another needed to decrease the used tools, and to make this investigation fair enough, same program had been used, but had been built differently than the previous scenario. first part of this paper a rich material will be produce on the Cloud Computing and Hadoop, to understand the characteristics,architecture and the work nature of both Cloud Computing and Hadoop separately, The second part of this paper is listing a group of the previous works that had done in the previous years, also studying these works with some kind of details, The third part of this paper list the objectives of this paper, the forth part aims to inauguration a tiny virtual cloud, then inauguration of Hadoop single cluster in one machine, building two different Cases to investigate the behavior of the work of Cloud Computing and Hadoop, The fifth part will presents the results of processing of 72 files of different sizes and contents from the two Cases that had been built in the fourth part, then a discussion is obtained with results, to determine if there is an improvement, based on the merging of the Cloud Computing with Hadoop, The sixth part present a final conclusion. III. OBJECTIVE The objective of this paper, build a tiny virtual cloud and merge hadoop with the cloud, then built two different kinds of cases, one of these cases include the enhanced hadoop and the cloud both together, the other case is the cloud itself

3 without hadoop, then test the performance of hadoop and cloud, compare the obtained results of enhanced hadoop and the cloud in one side and in the other side the results of the cloud without hadoop can be increased or not. IV. METHODOLOGY First of all a tiny cloud of normal resource requirement had been established then single cluster hadoop installed, this cloud include all component of hadoop in the same cloud, in this cloud a virtual data center had been built, which mean the master server and the slave server in the same cloud. The administrator of the cloud can monitor the performance of the running servers in the virtual data center, using two window interfaces which provide the complete information of the virtual data center. The health of the HDFS or the servers can be monitor, using especial site, the administrator can connect to the HDFS with this port number (50070). Also monitoring the performance of MapReduce, and monitoring every request come from the client, and how this request will be served, how many time it will take till finishing each request, from this port number this job can be accomplish (50030). Figure1. The HDFS of Hadoop. Figure2. The MapReduce of Hadoop. Enhance the Performance of Cloud Computing with Hadoop Since hadoop had been built with Java language, also the second case of the cloud using only java, so java programming language had been also installed, the platform used to work with java, and connecting java with hadoop was Eclipse, one of the platforms used to run java language, what make this platform special, it s characterized with simplicity and with friendly interface. Two different kinds of scenarios had been built to test the performance of hadoop and cloud computing, first case is to test the performance of hadoop with cloud computing, the second case is to test the performance of cloud computing itself, without hadoop, the performance of this two cases aims on the ability to process the big data or files of big sizes. A. Case (1) In this case had been created another different two of MapReduce programs, to test the performance of the hadoop using more tools in one of program, these used tools used in this case didn t used in the previous paper, also in the other program had been decreased the used tools in the previous paper, this case had been done in two ways. The first one is building a MapReduce program to test the performance of cloud computing and hadoop in processing the files of words type. The second one is build a MapReduce program, to process big sizes files of numbers contents only. The MapReduce programs used in this case are: A- Search Program. B- Maximum Program. 1. Search Program The purpose of this program is for searching for a specific content, also print how many this content repeated in a specific file, the entries to this program must be only of alphabetic contents, To ensure processing of files with this program done in an effective way, first of all particular job created to this program, the job retrieves the files needed to be processed from the server where its already reside on it, also to processing it in parts: a. The search Algorithm: Mapper class Name = the word want to find Map function (value, key) While there is more values in the file do I = 0 Curr= next word in the file If ( curr = name) I = I+ 1 If ( I > 0) Printout the name found and the number of Appearance b. Partition Part: in this part, the job request from the client to give the word he/she want to find, then the job start

4 creating key and value, then the job broke the contents of file, taking each word and assign a key for each word in the file, each broken word come under value and each value take a particular numeric key, after partition the content of file the job takes the needed word as a searching key, start reading the intermediate words one after one, whenever the word that the client want to find is matched then the job count this word, till reaching to the end of these intermediate outputs, then the founded word and how many times it found print out to a file in the mentioned directory in the server as the client request. 2. Maximum Program This program is for processing files and to extract the maximum number from the whole content of file, and produce the result in the file print to the server, the file type that this program uses is only numeric files, To ensure processing of files with this program done in an effective way, when the client request to retrieve the maximum number of numeric file, job create and request from the client to locate in the server which needed to find the maximum number of it also to processing it in parts: a. The Maximum Algorithm: Mapper class Map function (value, key ) While there is more values in the file do Produce (value, key) Combiner class Combine function ( key, value) max = smallest value in the file For I to all values in the file do D=I; If(d > max) Max = d Produce (max, key) Reducer class Reduce function (key, value) max = smallest value in the file For I to all values in the file do D=I; If(d > max) Max = d Printout the maximum number b. In this part, in this part, the job breaks the contents of file to set of values, each value take a numeric key, the numbers treated as a strings and comes under the attribute value, and each value has a unique key, the job produce the output set of values and its keys and prepare it to the next part of finding Maximum number. c. Finding the Maximum number part 1: the job take the intermediate outputs, from partition part and make them inputs to the next part, now in this part will not take the same original file, it will take set of values and its keys, the job start here searching the minimum number of the intermediate output from the previous part, then the job take this minimum number as a maximum number and start compare the whole intermediate values with this number whenever found a number that is greater than this number take it as a maximum taking each value, so the job here produce as a output new intermediate values, where the founded maximum number be produced from this part with its key, then as a result the job produce new set of founded maximum numbers with its key. d. Finding Maximum number Part 2: the job take the intermediate outputs, from (Finding Maximum number Part1) where at most it s found the maximum numbers but not as a final result, only to reduce the work of the job in the final part, in this part again the job take the same procedure of finding the maximum number, the job again search to the minimum number, among the set of the founded maximum numbers and then the job take this minimum number as a maximum number and start compare the whole intermediate values with this number whenever found a number that is greater than this number take it as a maximum, till reach the end of the intermediate number, then the job print the maximum number in a file and print it to the directory located from the client. B. Case (2) In this case the test based on testing the behavior of cloud computing without using hadoop, and to make the test based on the consistent basic`s, programs used in this testing are the same in the name but of course completely different from the programs used in the last two cases, also since the tested behavior based on testing the behavior of the cloud using files of numbers, also using files of characters, so in the third case, had been created two programs, one of the two programs created to processing the different sizes of files, containing only numbers, the second program for processing different sizes of alphabetic files. The MapReduce Programs are: A- Search Program. B- Maximum Program. 1. Search Program The purpose of this program is for searching for a specific content, also print how many this content repeated in a

5 Enhance the Performance of Cloud Computing with Hadoop specific file, the entries to this program must be only of alphabetic contents, using normal java, had been created normal program for search complete file and find the particular word, also counting how many this word repeated in this program. 2. Maximum Program This program is for processing files and to extract the maximum number from the whole content of file, and produce the result in the file print to the server, this program work only with numeric content of files, using normal java, had been created normal program for finding the maximum number in the complete file, find the minimum number in the file, then taking this number as a the maximum number and making comparing with all the content of file, then print out the maximum found number. V. RESULTS According to the two cases had been created in the cloud computing environment, to make the simulation near as much as possible to the reality, in this paper had been created different files of txt type, To test the virtual cloud with and without hadoop, also the cloud ability to process the two cases, in this paper had been used different size of files, to make the analysis of cloud computing and hadoop, fair as much as possible, the size of the files used in the cloud, twelve different size of files, these files are contain only numbers, and their sizes are (30, 60, 90, 120, 150, 180, 200, 210, 240, 270, 300, 400) MB, also another twelve different size of files, but these files contain only words, and their sizes are (30, 60, 90, 120, 150, 180, 200, 210, 240, 270, 300, 400) MB. A. Case (1): Using 24 files in total of txt files, two different programs had been tested in this case and under different circumstances the results that had been obtained from the first case using normal hadoop. Figure4: The Case (2) Results. VI. DISSCUSION A. Maximum Program The behavior of this program had been test using case(1) in the virtual cloud with basic tools of hadoop,, in the case(2) in the virtual cloud without using hadoop, the results of the processing time in these three cases using the same txt files and the same sizes of these files are shown in the table below, Maximum program to get the maximum number in the file, for the minimum file of size 30 MB the elapsed processing time for the whole file was (21.604) seconds, and for the maximum file of size 400 MB was ( ) seconds, that means in the 30 MB file the elapsed time was 36 seconds and as a maximum in the 400 MB was 3 minutes and 57 seconds, the results taken with different circumstances. According to the above table, case (1) that had been tested under the virtual cloud using more tools in hadoop had proved better improvement comparing with the case (2) for the same program running under the virtual cloud without using hadoop, the improvement range was between ( %) as a minimum and ( %) as a maximum, so the total average improvement was ( %). Figure3: The Case (1) Results. B. Case (2): In the second and last case the using only normal java without using hadoop, and using in total 24 files of txt types the results was : TABLE1. MAXIMUM PROGRAM COMPARISON Size Case(1) Case(2) Improvement %

6 B. Search Program: This program had been tested with three cases, in the case (1) this program had been tested with the virtual cloud with hadoop that uses here the necessary tools for executing the specific program, the second case the program had been tested with case (2) using the virtual cloud without using hadoop, the processing time that the program taking with 12 txt files with these three cases are: TABLE2. SEARCH PROGRAM COMPARSION Size Case(1) Case(2) Improvement % In the search program was only for searching a specific word chosen randomly, for the minimum size of file 30 MB the elapsed time for processing the complete file (13.348) seconds, and for the maximum size file the elapsed time for processing the complete file was (76.17) seconds, so the processing time for the file of 30 MB taking only 22 seconds, and in the maximum file of 400 MB the elapsed time was one minute and 26 seconds, these results was taken different circumstances. According to the below table, the behavior of search program in case (1) with case(2), in case (1) was showed better improvement than case(2), the improvement of case (1) was in range between ( %) and in maximum was ( %), the total average improvement of case (1) comparing with case (2) was ( %) VII. CONCLUSION In this paper the behavior of the cloud with hadoop had been tested, in the previous paper the improvement result was 89%, so in this paper the used tools of hadoop had been enhanced based of the situation of each program, each and every case and each and every file had been processed with different circumstances, to test the ability of the cloud to deal the different requests that come from different clients, with the normal hardware requirement that shown in the table below, the cloud and hadoop in case (1) improved its ability to take the different requests come from the clients, serve these requests with an optimum time, without affecting the machine health that hold the hole cluster, the maximum time was taken with cloud and hadoop was approximately 3.57 minutes with files of 400 MB. TABLE3. THE HARDWARE AND SOFTWARE REQUIRNMENTS Hardware and software requirements 1. Hard Disk 80 GB SCSI 2. Memory 2 GB 3. Processor Intel(R)Core M 4. CPU 2.40 GHz 5. Operating system SentOS final 64 bit Also the behavior of the cloud alone without hadoop had been tested, dealing with different request come from different clients, had been showed that serving each request from the clients, had made the machine acting very slowly and the processing time with the increasing of file size had also increased, which mean the cloud act in very bad way with the increasing of the number of the incoming requests from the clients, and also with the increasing of the file sizes. VIII. REFERENCES [1] L. Li and M. Zhang (2011) The Strategy of Mining Association Rule Based on Cloud Computing, International Conference on Business Computing and Global Informatization, pp [2] J. Chen, Y. Larosa, P. Yang (2012), Optimal QoS Load Balancing Mechanism for Virtual +Machines Scheduling in Eucalyptus Cloud Computing Platform, in the 2nd Baltic Congress on future Internet Communications, pp [3] L. Li and M. Zhang (2013), Rhizome: A Middle-Ware For Cloud Vision Computing Framework, ICSSC IEEE, pp [4] D. Raad, S. Singh, A Comparative Analysis Of The Performance Of Cloud Computing With Java And Hadoop, International Journal of Computer Science Engineering and Information Technology Research,pp , [5] Cloud Computing: A Practical Approach by Anthony T. Velte, Toby J. Velte and Robert Elsenpeter (2010). [6] Hadoop For Dummies, Special Edition by Robert D. Schneider (2012). [7] Hadoop in Practice by Alex Holmes (2012). [8] Hadoop: The Definitive Guide 3rd edition by Tom White (2012). [9] Hadoop in Action by Chuck Lam (2011). [10] NIST(National Institute of Standerds and Technology ), The NIST Cloud Computing Standreds Roadmap. America: Special Publication ,by Lee Badger, Tim Grance, Robert Patt-Corner and Jeff Voas (2012).

7 Enhance the Performance of Cloud Computing with Hadoop [11] Welcome to Apache Hadoop, apache.org/. [12] Download Eclipse for linux, censtos, [13] Download Java JDK for centos technetwork/java/javase/downloads/jre7-downloads html. [14] Download VMPlayer vmware/downloads. [15] Downloiad CentOS centos/6/isos/x86_64/. [16] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, Electron spectroscopy studies on magneto-optical media and plastic substrate interface, IEEE Transl. J. Magn. Japan, vol. 2, pp , August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]. [17] M. Young, The Technical Writer s Handbook. Mill Valley, CA: University Science, 1989.

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud) Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System (HDFS) Overview 2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

Cloud Federation to Elastically Increase MapReduce Processing Resources

Cloud Federation to Elastically Increase MapReduce Processing Resources Cloud Federation to Elastically Increase MapReduce Processing Resources A.Panarello, A.Celesti, M. Villari, M. Fazio and A. Puliafito {apanarello,acelesti, mfazio, mvillari, apuliafito}@unime.it DICIEAMA,

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com,

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Student's Awareness of Cloud Computing: Case Study Faculty of Engineering at Aden University, Yemen

Student's Awareness of Cloud Computing: Case Study Faculty of Engineering at Aden University, Yemen Student's Awareness of Cloud Computing: Case Study Faculty of Engineering at Aden University, Yemen Samah Sadeq Ahmed Bagish Department of Information Technology, Faculty of Engineering, Aden University,

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Performance Analysis of Book Recommendation System on Hadoop Platform

Performance Analysis of Book Recommendation System on Hadoop Platform Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

Load Balancing in Cloud Computing using Observer's Algorithm with Dynamic Weight Table

Load Balancing in Cloud Computing using Observer's Algorithm with Dynamic Weight Table Load Balancing in Cloud Computing using Observer's Algorithm with Dynamic Weight Table Anjali Singh M. Tech Scholar (CSE) SKIT Jaipur, 27.anjali01@gmail.com Mahender Kumar Beniwal Reader (CSE & IT), SKIT

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Reallocation and Allocation of Virtual Machines in Cloud Computing Manan D. Shah a, *, Harshad B. Prajapati b

Reallocation and Allocation of Virtual Machines in Cloud Computing Manan D. Shah a, *, Harshad B. Prajapati b Proceedings of International Conference on Emerging Research in Computing, Information, Communication and Applications (ERCICA-14) Reallocation and Allocation of Virtual Machines in Cloud Computing Manan

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

Cloud Computing Simulation Using CloudSim

Cloud Computing Simulation Using CloudSim Cloud Computing Simulation Using CloudSim Ranjan Kumar #1, G.Sahoo *2 # Assistant Professor, Computer Science & Engineering, Ranchi University, India Professor & Head, Information Technology, Birla Institute

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm ( Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

CURSO: ADMINISTRADOR PARA APACHE HADOOP

CURSO: ADMINISTRADOR PARA APACHE HADOOP CURSO: ADMINISTRADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 A developer has submitted a long running MapReduce job with wrong data sets. You

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

EFFICIENT JOB SCHEDULING OF VIRTUAL MACHINES IN CLOUD COMPUTING

EFFICIENT JOB SCHEDULING OF VIRTUAL MACHINES IN CLOUD COMPUTING EFFICIENT JOB SCHEDULING OF VIRTUAL MACHINES IN CLOUD COMPUTING Ranjana Saini 1, Indu 2 M.Tech Scholar, JCDM College of Engineering, CSE Department,Sirsa 1 Assistant Prof., CSE Department, JCDM College

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

NTT DOCOMO Technical Journal. Large-Scale Data Processing Infrastructure for Mobile Spatial Statistics

NTT DOCOMO Technical Journal. Large-Scale Data Processing Infrastructure for Mobile Spatial Statistics Large-scale Distributed Data Processing Big Data Service Platform Mobile Spatial Statistics Supporting Development of Society and Industry Population Estimation Using Mobile Network Statistical Data and

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information